Artificial intelligence (AI) companies have reached the far borders of the “data frontier.”
That’s according to a report Saturday (Aug. 31) by the Financial Times (FT), which noted that this issue is making things worse as AI firms face a wave of copyright lawsuits and allegations that they are aggressively scraping data from the internet.
For example, AI startup Anthropic was recently sued by a trio of authors who accused the company of “stealing hundreds of thousands of copyrighted books,” saying the company “never sought — let alone paid for — a license to copy and exploit the protected expression contained in the copyrighted works fed into its models.”
(An Anthropic spokesperson has told PYMNTS it was aware of the suit and was assessing the complaint, but declined to comment further.)
As the FT reported, this suit is among a host of other copyright cases against AI companies, most notably The New York Times’ suit against OpenAI and Microsoft, which claims these are benefitting “from the massive copyright infringement, commercial exploitation and misappropriation of The Times’s intellectual property.” OpenAI has called the suit “without merit.”
While AI companies have made major advances in the last year and a half, they are now facing a dearth of data — the so-called “data frontier” — forcing them to dig deeper and deeper into the web, reach deals to access private datasets or use synthetic data.
“There’s no more free lunch. You can’t scrape a web-scale dataset any more. You have to go and purchase it or produce it. That’s the frontier we’re at now,” said Alex Ratner, co-founder of Snorkel AI, which creates and labels datasets for businesses.
The report also pointed to another case involving Anthropic, which was recently accused by website owners of “egregious” data scraping. The company has said that it works not to be “intrusive or disruptive.”
As PYMNTS wrote earlier this summer, the financial implications of content scraping are substantial. Companies invest significant resources in researching, writing and publishing website content. Experts say that letting bots to scrape this material freely undermines these efforts and can produce derivative content that outranks the original in search results.
“When their information is scraped, especially in near real-time, it can be summarized and posted by an AI over which they have no control, which in turn deprives the content creator of getting its own clicks — and the attendant revenue,” HP Newquist, executive director of The Relayer Group and author of “The Brain Makers,” said in an interview with PYMNTS.
For all PYMNTS AI coverage, subscribe to the daily AI Newsletter.