OpenAi has been accused by many Parts of training their ai on content with copyright without permission. Now a new one paper According to a Watchdog ai organization, it makes a serious accusation that the company depended more and more on non -public books that did not graduate to train more sophisticated models.
IA models are essentially complex prediction engines. Trained in many data (books, movies, television programs, etc., they learn patterns and innovative forms of extrapolar from a simple message. When a model “writes” an essay on a Greek tragedy or “draw” Gibli style images, it is simply extracting from your vast knowledge to approach. It is not reaching anything new.
Although several ai laboratories, including OpenAi, have begun to adopt data generated by ai to train ai as real world's sources depleted (mainly the public website), few have avoided the real world data completely. That is likely because training in purely synthetic data comes with risks, such as worsening the performance of a model.
The new article, of the IA Dissemination project, a non-profit organization co-founded in 2024 by the magnate of the media Tim O'Reilly and economist Ilan Strauss, concludes that Operai probably trained his GPT-4o model in O'Reilly books of O'Rilly. (O'Reilly is the CEO of O'Reilly Media).
In ChatGPT, GPT-4O is the predetermined model. O'Reilly does not have a license agreement with Openai, says the newspaper.
“GPT-4O, the most recent and capable model of Openai, demonstrates a strong recognition of Paywalled O'Reilly book content … compared to the previous model GPT-3.5 OpenAi Turbo,” wrote the co-authors of the article. “In contrast, GPT-3.5 Turbo shows greater relative recognition of samples of publicly accessible O'Reilly books.”
The document used a method called DiscourageFirst introduced into an academic article in 2024, designed to detect copyright content in language models training data. Also known as an “membership attack attack”, the test method if a model can reliably distinguish texts authorized by humans of paraphrased versions of the same text. If you can, it suggests that the model could have a previous knowledge of the text of your training data.
The co-authors of the newspaper-O'Reilly, Strauss and the researcher of the Sruly Rosenblat-say that probe the knowledge of GPT-4O, GPT-3.5 Turbo and other operai models of the O'Reilly media books published before and after their training dates. They used 13,962 paragraph extracts of 34 books from O'Reilly to estimate the probability that a particular extract has been included in the training data set of a model.
According to the results of the document, GPT-4O “acknowledged” much more O'Reilly book content than the oldest openai models, including GPT-3.5 turbo. That is even after taking into account the possible confusion factors, the authors said, such as the improvements in the ability of the newest models to discover if the text was authorized by humans.
“GPT-4O (probably) acknowledges, and so has made the previous knowledge of many non-public books of O'Reilly published before their training date,” the co-authors wrote.
It is not a smoking gun, co -authors are careful to take into account. They recognize that their experimental method is not infallible and that OpenAi could have collected book extracts with payment walls of users who copy it and hit it in Chatgpt.
In water consumption, co-authors did not evaluate the most recent OpenAI collection, which includes GPT-4.5 models and “reasoning” such as O3-mini and O1. It is possible that these models were not trained in the data of O'Reilly's book by Paywalled or that they have been trained in a smaller amount than GPT-4O.
That said, it is no secret that OpenAI, who has advocated the most flexible restrictions around developing models using copyright data, has been looking for higher quality training data for some time. The company has gone as far as <a target="_blank" rel="nofollow" href="https://www.niemanlab.org/2025/02/meet-the-journalists-training-ai-models-for-meta-and-openai/”>Contract journalists to help adjust the exits of their models. That is a trend throughout the industry in general: ai companies that recruit experts in domains such as science and physics for <a target="_blank" href="https://www.theinformation.com/articles/why-a-14-billion-startup-is-now-hiring-phds-to-train-ai-from-their-living-rooms” target=”_blank” rel=”noreferrer noopener nofollow”>Make these experts effectively feed their knowledge in ai systems.
It should be noted that Openai pays at least some of your training data. The company has license agreements with news editors, social networks, media libraries and others. Operai also offers exclusion mechanisms Although imperfect – That allow the copyright owners to mark the content that would prefer that the company not use for training purposes.
Even so, while OpenAi fights several demands for their training data practices and the treatment of the copyright law in the courts of the United States, O'Reilly's article is not the most flattering aspect.
Operai did not respond to a request for comments.
(Tagstotranslate) Copyright (T) Openai