TechyMag.com - is an online magazine where you can find news and updates on modern technologies


Back
Technologies

OpenAI destroyed 100,000 books on which GPT-3 was trained. The people involved also disappeared somewhere

OpenAI destroyed 100,000 books on which GPT-3 was trained. The people involved also disappeared somewhere
0 0 3 0

The company OpenAI has removed two huge datasets, "books1" and "books2", which were used to train the GPT-3 model.

This was reported by Business Insider, citing materials from the Authors Guild lawsuit.

Essence of the lawsuit

Authors Guild lawyers stated that the GPT-3 datasets likely contained "over 100,000 published books". Thus, OpenAI used copyrighted materials to train AI models.

Reference.Authors Guild is the oldest (established in 1912) and most authoritative professional organization of writers in the United States. It is involved in defending freedom of speech and copyrights.

For several months, Authors Guild requested OpenAI to provide information about the datasets used. Initially, the company refused, citing confidentiality provisions. But then it turned out that they had completely deleted all copies of the data.

High-quality training data is an important part of powerful AI models. To build these models, OpenAI and other companies use data from the internet, including books.

Many companies that created this information want to be compensated for providing information to these new AI products. Technology companies do not want to be forced to pay. Currently, this dispute is being settled in court through several lawsuits.

100,000 books - 16% of GPT-3 training data

In a technical document from 2020, OpenAI described the datasets books1 and books2 as "a corpus of books from the internet" and stated that they constituted a total of 16% of the training data used in creating GPT-3.

The document also mentioned that "books1" and "books2" together contained 67 billion tokens, or approximately 50 billion words.

OpenAI stopped using "books1" and "books2" for model training at the end of 2021. In mid-2022, they were deleted - due to being "unsuitable for use".

Furthermore, the documents mention that the two researchers who created the datasets "books1" and "books2" no longer work at OpenAI. OpenAI refuses to disclose information about them, although Authors Guild insists on it.

OpenAI has turned to the court to request the preservation of the names of the employees, as well as information about the datasets.

"The models that are currently being used by ChatGPT and our API were not created using these datasets," OpenAI stated on Tuesday.

It is worth recalling the case when AI researcher and former Amazon manager Viviane Gaderi accused her former employer of copyright infringement.

In March, the director of her team tasked her with finding out why Amazon was not meeting its goals for Alexa search quality. During the conversation, he recommended ignoring copyright policy to improve results. The director asked to pay attention to competitors with the words "everyone does it."

Thanks, your opinion accepted.

Comments (0)

There are no comments for now

Leave a Comment:

To be able to leave a comment - you have to authorize on our website

Related Posts