OpenAI destroyed 100,000 books on which GPT-3 was trained. The people involved also disappeared somewhere

The company OpenAI has removed two huge datasets, "books1" and "books2", which were used to train the GPT-3 model.

This was reported by Business Insider, citing materials from the Authors Guild lawsuit.

Essence of the lawsuit

Authors Guild lawyers stated that the GPT-3 datasets likely contained "over 100,000 published books". Thus, OpenAI used copyrighted materials to train AI models.

Reference.Authors Guild is the oldest (established in 1912) and most authoritative professional organization of writers in the United States. It is involved in defending freedom of speech and copyrights.

For several months, Authors Guild requested OpenAI to provide information about the datasets used. Initially, the company refused, citing confidentiality provisions. But then it turned out that they had completely deleted all copies of the data.

High-quality training data is an important part of powerful AI models. To build these models, OpenAI and other companies use data from the internet, including books.

Many companies that created this information want to be compensated for providing information to these new AI products. Technology companies do not want to be forced to pay. Currently, this dispute is being settled in court through several lawsuits.

100,000 books - 16% of GPT-3 training data

In a technical document from 2020, OpenAI described the datasets books1 and books2 as "a corpus of books from the internet" and stated that they constituted a total of 16% of the training data used in creating GPT-3.

The document also mentioned that "books1" and "books2" together contained 67 billion tokens, or approximately 50 billion words.

OpenAI stopped using "books1" and "books2" for model training at the end of 2021. In mid-2022, they were deleted - due to being "unsuitable for use".

Furthermore, the documents mention that the two researchers who created the datasets "books1" and "books2" no longer work at OpenAI. OpenAI refuses to disclose information about them, although Authors Guild insists on it.

OpenAI has turned to the court to request the preservation of the names of the employees, as well as information about the datasets.

"The models that are currently being used by ChatGPT and our API were not created using these datasets," OpenAI stated on Tuesday.

It is worth recalling the case when AI researcher and former Amazon manager Viviane Gaderi accused her former employer of copyright infringement.

In March, the director of her team tasked her with finding out why Amazon was not meeting its goals for Alexa search quality. During the conversation, he recommended ignoring copyright policy to improve results. The director asked to pay attention to competitors with the words "everyone does it."

Recently Published Posts

Quantum entangled photon "traveled" 35 km through the streets of ...

Youtubers found and restored Archie, the oldest search engine on ...

A 40 kg remnant of SpaceX's Dragon spacecraft likely fell onto a ...

GTA 5 became the third best-selling game in the world after Tetri...

OpenAI destroyed 100,000 books on which GPT-3 was trained. The people involved also disappeared somewhere

Essence of the lawsuit

100,000 books - 16% of GPT-3 training data

Related tags:

Lenovo unveiled the world's first notebook with LPCAMM2 memory - smaller size, better speed and power consumption

Intel: motherboard makers did it all wrong - first official public statement on CPU problems

How do you like post?

Comments (0)

There are no comments for now

Leave a Comment:

To be able to leave a comment - you have to authorize on our website

Recently Published Posts

Subscribe

OpenAI destroyed 100,000 books on which GPT-3 was trained. The people involved also disappeared somewhere

Essence of the lawsuit

100,000 books - 16% of GPT-3 training data

Related tags:

How do you like post?

Comments (0)

There are no comments for now

Leave a Comment:

To be able to leave a comment - you have to authorize on our website

Related Posts