According to the latest complaint in a copyright infringement case that was initially filed this summer, Meta Platforms (META.O) ignored legal warnings about the legal dangers of utilizing thousands of pirated books to train its AI models.
In a new filing made late Monday night, several notable authors, including Pulitzer Prize winner Michael Chabon, comedian Sarah Silverman, and others, have sued Facebook and Instagram owner Meta, claiming that the company has used their works without authorization to train its artificial intelligence language model, Llama. The writers were granted permission to alter their claims last month by a court in California that rejected a portion of the Silverman complaint. However, Meta has not promptly addressed the allegations yet.
Evidence suggesting Meta knew its use of the books might not be protected by U.S. copyright law is included in the new complaint filed on Monday. The chat logs show a Meta-affiliated researcher discussing the dataset’s procurement in a Discord server. Dettmers said in his letter from a month ago that “the data cannot be used or models cannot be published if they are trained on that data,” according to the lawsuit, following a statement from Meta’s attorneys.
According to the chat logs included in the lawsuit, researcher Tim Dettmers recounts his dispute with Meta‘s legal department on the suitability of using the book files as training data, describing it as “legally ok.”
“At Facebook, there are a lot of people interested in working with (T)he (P)ile, including myself, but in its current form, we are unable to use it for legal reasons,” Dettmers wrote in 2021, referenced a dataset that Meta has admitted to using to train its first version of Llama, according to the complaint.
Although Dettmers refrains from elaborating on the attorneys’ worries, his chat partners pick “books with active copyrights” as the most probable cause for concern. Data training, according to them, should “fall under fair use,” a legal theory in the United States that shields some unauthorized uses of copyrighted creations. The University of Washington PhD candidate Dettmers also declined to comment on the allegations.
Content producers have started suing tech corporations in droves this year, claiming that tech companies have stolen copyrighted works to construct generative AI models, which have gone viral and sparked an investment frenzy. If such lawsuits succeed, it might damper the generative AI trend by making it more expensive to construct data-hungry models and requiring AI businesses to pay creators for using their works.
Simultaneously, businesses may face more legal risk if firms are required to reveal the data used to train their AI algorithms under new provisional regulations in Europe. Among the datasets used to train the Llama big language model, which Meta released in February, was “the Books3 section of ThePile.” The individual responsible for compiling the dataset has previously stated that it includes 196,640 books, as stated in the lawsuit.
Llama 2, the most recent version of the model, was released for commercial usage this summer, but the company chose not to reveal the training data for it. If your company has less than 700 million active users per month, you can use Llama 2 for free. When it was announced, the IT industry anticipated it would shake up the generative AI software market and challenge the dominance of companies like OpenAI and Google, who charge for access to their models.