Common Corpus: A Large Public Domain Dataset for LLM Training

In the dynamic landscape of artificial intelligence, a long-standing debate questions the need for copyrighted materials to train the best ai models. OpenAI's bold claim to the UK Parliament in 2023 that training such models without using copyrighted content was “impossible” sent shockwaves through the industry, sparking legal battles and ethical dilemmas. However, recent developments have challenged this conventional wisdom, offering compelling evidence that large language models can be trained without the controversial use of copyrighted materials.

The Common Corpus initiative has become the largest public domain dataset for LLM training. This international collaboration, led by Pleias and involving researchers with prior training in LLM, ai ethics and cultural heritage, has challenged the status quo and ushered in a new era of ai practices. This diverse and multilingual dataset shows the potential of forming LLMs without copyright concerns, marking a significant shift in the ai landscape.

Fairly Trained, a leading ai industry nonprofit, has taken a decisive step toward fairer ai practices. It has awarded its first certification for an LLM built without copyright infringement, a model known as KL3M. Developed by 273 Ventures, a Chicago-based legal technology consulting startup, KL3M is not just a model, but a ray of hope for fair ai. The rigorous certification process, overseen by Fairly Trained CEO Ed Newton-Rex, instills confidence in the potential of fair ai, stating that “there is no fundamental reason why someone cannot train an LLM fairly “.

Kelvin Legal DataPack, a training dataset meticulously created by Fairly Trained, includes thousands of legal documents reviewed for compliance with copyright laws. Despite its size of around 350 billion tokens, this data set is a testament to the power of curation. It may be smaller than those compiled by OpenAI and others who have scoured the internet, but its performance is exceptional. Jillian Bommarito, founder of the company, attributes the success of the KL3M model to the rigorous research process applied to the data. The potential for curated data sets like this to power ai models, precisely tailoring them to their designated tasks, is truly exciting. 273 Ventures is now offering coveted spots on a waiting list for clients eager to access this invaluable resource.

The researchers who developed Common Corpus took a bold step by using a collection of text equivalent in size to the data used to train OpenAI's GPT-3 model. They made it available on the open source ai platform Hugging Face. While Fairly Trained has only certified 273 Ventures LLMs, the emergence of Common Corpus and KL3M signals a shift in the ai landscape. Advocates for fairer ai, particularly for artists affected by data mining, see these initiatives as critical to challenging the norm. Fairly Trained's recent certifications, including Spanish voice modulation startup VoiceMod and ai heavy metal band Frostbite Orckings, show diversification beyond LLMs, suggesting a broader scope for ai certification.

While Kelvin Legal DataPack, a training dataset created by Fairly Trained, has its advantages, it also has limitations. This dataset includes thousands of legal documents reviewed for compliance with copyright laws and is a valuable resource. However, it is important to note that much of the available public domain data is outdated, especially in regions such as the United States, where copyright protection often extends beyond 70 years from the author's death. Therefore, this data set may not be suitable to inform an ai model at present.

Review the Blog, ai-without-slurping-copyrighted-content/” target=”_blank” rel=”noreferrer noopener”>Reference articleand Project. All credit for this research goes to the researchers of this project. Also, don't forget to follow us on Twitter. Join our Telegram channel, Discord channeland LinkedIn Grabove.

If you like our work, you will love our Newsletter..

Don't forget to join our 39k+ ML SubReddit

Sajjad Ansari is a final year student of IIT Kharagpur. As a technology enthusiast, he delves into the practical applications of ai with a focus on understanding the impact of ai technologies and their real-world implications. His goal is to articulate complex ai concepts in a clear and accessible way.

<!– ai CONTENT END 2 –>

Join the fastest growing ai research newsletter read by researchers from Google + NVIDIA + Meta + Stanford + MIT + Microsoft and many others…

Common Corpus: A Large Public Domain Dataset for LLM Training

Technical Terrence Team

1 cheap stock I would buy now

Leave a Reply Cancel reply

Recommended.

MassCUE Spring Conference: Leveling the Playing Field

Apple Intelligence is enabled by default in iOS 18.3

ESA's new heavy-lift rocket, Ariane 6, will launch for the first time on Tuesday

95% weekly increase to $0.2663!

Ethereum Layer 2 zkEVM ‘Scroll’ confirms mainnet launch

Categories

Important Links

Common Corpus: A Large Public Domain Dataset for LLM Training

Related

Technical Terrence Team

1 cheap stock I would buy now

Leave a Reply Cancel reply

Recommended.

MassCUE Spring Conference: Leveling the Playing Field

Apple Intelligence is enabled by default in iOS 18.3

ESA's new heavy-lift rocket, Ariane 6, will launch for the first time on Tuesday

95% weekly increase to $0.2663!

Ethereum Layer 2 zkEVM ‘Scroll’ confirms mainnet launch

Categories

Important Links

Get daily news updates to your inbox!