Several major publishers have taken steps to block the Internet Archive, a valuable resource for journalists and researchers, citing concerns that AI companies' bots are using its collections as a workaround. The non-profit digital library's access to their content has been restricted due to fears that it could be used by AI businesses to indirectly scrape articles.
The head of business affairs and licensing for The Guardian, Robert Hahn, stated that the Internet Archive's API would have been an obvious place for AI companies to plug in and suck out the IP. Similarly, The New York Times has blocked the Internet Archive's bot from accessing its content, citing the Wayback Machine's unfettered access to its articles without authorization.
Subscription-focused publication Financial Times and social forum Reddit have also made moves to selectively block how the Internet Archive catalogs their material. These publishers are among those that have attempted to sue AI businesses for their methods of accessing content used to train large language models.
For instance, The New York Times has sued OpenAI and Microsoft, while the Center for Investigative Reporting has also taken action against these companies. Similarly, The Wall Street Journal and New York Post have sued Perplexity, and a group of publishers including The Atlantic, The Guardian, and Politico have sued Cohere.
While some media outlets have sought financial deals before offering up their libraries as training material, the issue remains complex due to copyright and piracy concerns in other creative fields such as fiction writers, visual artists, and musicians.
The head of business affairs and licensing for The Guardian, Robert Hahn, stated that the Internet Archive's API would have been an obvious place for AI companies to plug in and suck out the IP. Similarly, The New York Times has blocked the Internet Archive's bot from accessing its content, citing the Wayback Machine's unfettered access to its articles without authorization.
Subscription-focused publication Financial Times and social forum Reddit have also made moves to selectively block how the Internet Archive catalogs their material. These publishers are among those that have attempted to sue AI businesses for their methods of accessing content used to train large language models.
For instance, The New York Times has sued OpenAI and Microsoft, while the Center for Investigative Reporting has also taken action against these companies. Similarly, The Wall Street Journal and New York Post have sued Perplexity, and a group of publishers including The Atlantic, The Guardian, and Politico have sued Cohere.
While some media outlets have sought financial deals before offering up their libraries as training material, the issue remains complex due to copyright and piracy concerns in other creative fields such as fiction writers, visual artists, and musicians.