OpenAI Legal Risks Rise as Court Orders Disclosure of 2 Book Datasets in Copyright Case

A federal judge just ruled that OpenAI has to hand over internal chats about deleting two major book datasets. This decision gives authors stronger ammunition to argue willful copyright infringement and opens up the company's data practices to serious scrutiny.

⬤ OpenAI's legal troubles are piling up after a U.S. court ordered the company to release internal communications about deleting two massive book datasets. The ruling forces OpenAI to disclose Slack messages discussing the removal of the "books1" and "books2" datasets—collections that authors claim were stuffed with pirated material used to train the company's AI models. The whole situation underscores just how serious the legal environment has become for the AI giant.

⬤ This ruling is a huge win for plaintiffs and expands what they can dig into. They've already gotten their hands on Slack messages about dataset cleanup efforts, including conversations in channels tied to "project clear" and "excise libgen." Here's the thing: OpenAI initially told the court those datasets were removed simply because they weren't being used anymore. But then they switched gears and claimed communications about the deletion were protected by attorney-client privilege. Judge Ona Wang wasn't having it—she determined that changing the story like that meant OpenAI had waived privilege. Now the company has to turn over more documents and even have its internal legal team sit for depositions. Authors will get to examine exactly why OpenAI decided to delete those datasets.

⬤ The stakes could absolutely explode from here. If those newly disclosed messages show OpenAI knew the book data came from illegal sources and tried to quietly bury the evidence, authors are going to push hard to label this as willful copyright infringement. That classification would massively increase statutory damages per book and send OpenAI's financial exposure through the roof. The company has appealed, but these expanded discovery obligations create major uncertainty as multiple lawsuits keep rolling forward.

⬤ This development spotlights the broader risks facing AI companies that train massive models on scraped digital content. The court's decision makes it clear that internal discussions about data sources, shadow library materials, and dataset cleanup are going to be central in determining liability going forward. The outcome here could reshape compliance standards, documentation practices, and risk management throughout the rapidly expanding AI industry.

News Source

#AI #AI News #OpenAI

Eseandre Mordi E-mail

Eseandre Mordi - writer covering crypto, blockchain, and AI with a global perspective and a strong voice for women in tech.