In the rapidly developing fields of artificial intelligence and data science, the volume and accessibility of training data are critical factors in determining the capabilities and potential of large language models (LLMs). These models use large volumes of textual data to train and improve your language comprehension skills.
A recent twitter.com/mark_cummins/status/1788949889495245013?s=43&t=wGmgWgsrh94uK2vjEBUYDA” target=”_blank” rel=”noreferrer noopener”>Mark Cummins tweet looks at how close we are to exhausting the global pool of text data needed to train these models, given the exponential expansion in data consumption and the demanding specifications of next-generation LLMs. To explore this question, we share some textual sources currently available in different media and compare them to the growing needs for sophisticated ai models.
- Web Data: The English text portion of the FineWeb dataset alone, which is a subset of the Common Crawl web data, has a staggering 15 trillion tokens. The corpus can double in size when premium non-English web content is added.
- Code repositories: Approximately 0.78 billion tokens are contributed through publicly available code, such as that compiled into the Stack v2 dataset. While this may seem insignificant compared to other sources, the total amount of code worldwide is expected to be significant, amounting to tens of trillions of tokens.
- Academic publications and patents: The total volume of academic publications and patents is approximately 1 billion tokens, which is a sizeable but unique subset of textual data.
- Books: With over 21 trillion tokens, digital book collections from sites like Google Books and Anna's Archive constitute a huge amount of textual content. When all the different ledgers in the world are taken into account, the total token count increases to 400 billion tokens.
- Social media archives: User-generated material is hosted on platforms such as Weibo and twitter, which together represent a symbolic count of approximately 49 trillion. facebook stands out especially with 140 billion tokens. This is an important resource but mostly unattainable due to ethical and privacy issues.
- Audio Transcription: The training corpus obtains around 12 billion tokens from publicly accessible audio sources such as YouTube and TikTok.
- Private communications: Emails and stored instant conversations add up to a huge amount of text data, approximately 1.8 trillion tokens when added together. Access to this data is limited, raising ethical and privacy issues.
There are ethical and logistical obstacles to future growth as current LLM training data sets approach the 15 billion token level, which represents the amount of high-quality English text that is available. Reaching out to other resources such as books, audio transcriptions, and corpora from different languages could result in small improvements, possibly increasing the maximum amount of high-quality, readable text to 60 billion tokens.
However, token counts in private data stores run by Google and facebook run into the quadrillion outside the realm of ethical business enterprises. Due to the limitations imposed by limited and morally acceptable textual sources, the future course of LLM development depends on the creation of synthetic data. With access to private data repositories prohibited, data synthesis appears to be a key future direction for ai research.
In conclusion, there is an urgent need for unique ways of teaching LLM, given the combination of increasing data needs and limited text resources. To overcome the approaching limits of LLM training data, synthetic data becomes increasingly important as existing data sets approach saturation. This paradigm shift draws attention to how the field of ai research is changing and forces a deliberate shift toward synthetic data synthesis to maintain continued advancement and ethical compliance.
Tanya Malhotra is a final year student of University of Petroleum and Energy Studies, Dehradun, pursuing BTech in Computer Science Engineering with specialization in artificial intelligence and Machine Learning.
She is a Data Science enthusiast with good analytical and critical thinking, along with a burning interest in acquiring new skills, leading groups and managing work in an organized manner.
(Recommended Reading) GCX by Rightsify: Your go-to source for high-quality, ethically sourced, copyright-cleared ai music training datasets with rich metadata
<script async src="//platform.twitter.com/widgets.js” charset=”utf-8″>