Meta Trained Its Llama AI Models Using 81.7 TB of Books Stolen From Torrent Shadow Libraries

Meta Platforms, Inc. is facing serious allegations in a copyright infringement lawsuit, with plaintiffs claiming the tech giant used 81.7 terabytes of pirated books from shadow libraries to train its Llama AI models.

The lawsuit, filed in the U.S. District Court for the Northern District of California, accuses Meta of illegally torrenting copyrighted material from sources such as Z-Library and LibGen, despite internal concerns over the legality and ethics of such actions.

The plaintiffs, led by author Richard Kadrey and others representing a proposed class, have filed a motion objecting to a pretrial discovery ruling that they argue limits their ability to gather critical evidence against Meta.

Meta Trained Its Llama AI Models Using 81.7 TB of Books Stolen From Torrent Shadow Libraries

They claim that Meta’s last-minute disclosure of over 2,000 documents on December 13, 2024 mere hours before the close of fact discovery revealed damning admissions by employees about using pirated materials for AI training.

Newly unsealed emails reportedly reveal the strongest evidence yet against Meta in a copyright lawsuit filed by book authors, who claim the company unlawfully trained its AI models using pirated books.

Among the disclosed documents are internal communications acknowledging that databases like LibGen are “pirated” and expressing ethical concerns about their use.

One employee reportedly stated, “I feel that using pirated material should be beyond our ethical threshold.” Another document indicates that Meta’s decision to use LibGen was escalated to CEO Mark Zuckerberg.

Authors claim that internal emails about torrenting prove Meta was aware it was illegal. They point to warnings from employee Bashlykov, which they say were ignored.

Instead of stopping, Meta allegedly tried to cover its tracks, secretly downloading and sharing terabytes of data from shadow libraries as recently as April 2024.

Massive Data Acquisition

The plaintiffs allege that Meta torrented at least 81.7 terabytes of data from shadow libraries in recent years, including 35.7 terabytes from Z-Library and LibGen via Anna’s Archive.

This data reportedly includes tens of millions of copyrighted works used to train Llama models. The scale of this alleged piracy dwarfs many previous cases involving intellectual property theft.

The plaintiffs are challenging several aspects of a recent discovery ruling:

  • Reopening Depositions: They argue that the late-disclosed documents contradict prior testimony from key Meta witnesses and justify reopening depositions to question them about these revelations.
  • Torrenting Data: Plaintiffs are seeking access to Meta’s torrenting logs and peer-sharing records to demonstrate how much pirated material was downloaded and redistributed.
  • Llama 4 and 5 Training Datasets: The plaintiffs claim that datasets used for upcoming versions of Llama are relevant to their case and should be produced.
  • Crime-Fraud Exception: They allege that Meta’s attorneys were involved in decisions to use pirated materials despite knowing it was illegal, warranting an in-camera review of privileged communications under the crime-fraud exception.

This case could have far-reaching consequences for the tech industry, particularly regarding the ethical and legal standards for using copyrighted materials in AI development.

If the plaintiffs succeed, it could set a precedent for holding companies accountable for using unauthorized content in machine learning models.

Meta has not yet responded publicly to these latest allegations. A hearing date for the court to consider the plaintiffs’ objections has not been scheduled.

LEAVE A REPLY

Please enter your comment!
Please enter your name here

More like this

Smart Glasses and Security: Managing Your Privacy With Wearable...

Hail and Rapper Botnet is the Mastermind Behind the...

DeepSeek AI is Now Powering With Huawei Ascend 910C...