You can find useful datasets on countless platforms—Kaggle, Paperwithcode, GitHub, and more. But what if I tell you there’s a goldmine: a repository packed with over 400+ datasets, meticulously categorised across five essential dimensions—Pre-training Corpora, Fine-tuning Instruction Datasets, Preference Datasets, Evaluation Datasets, and Traditional NLP Datasets and more? And to top it off, this collection receives regular updates. Sounds impressive, right?
These datasets were compiled by Yang Liu, Jiahuan Cao, Chongyu Liu, Kai Ding, and Lianwen Jin in their survey on the paper “Datasets for Large Language Models: A Comprehensive Survey,” which has just been released (February 2024). It offers a groundbreaking look at the backbone of large language model (LLM) development: datasets.
Note: I am providing you with a brief description of the datasets mentioned in the research paper; you can find all the datasets in the repo.
Datasets for Your GenAI/LLMs Project: Abstract Overview of the Paper
Source: Datasets for Large Language Models: A Comprehensive Survey
This paper sets out to navigate the intricate landscape of LLM datasets, which are the cornerstone behind the stellar evolution of these models. Just as the roots of a tree provide the necessary support and nutrients for growth, datasets are fundamental to LLMs. Thus, studying these datasets isn’t just relevant; it’s essential.
Given the current gaps in comprehensive analysis and overview, this survey organises and categorises the essential types of LLM datasets into five primary perspectives:
- Pre-training Corpora
- Instruction Fine-tuning Datasets
- Preference Datasets
- Evaluation Datasets
- Traditional Natural Language Processing (NLP) Datasets
- Multi-modal Large Language Models (MLLMs) Datasets
- Retrieval Augmented Generation (RAG) Datasets.
The research outlines the key challenges that exist today and suggests potential directions for further exploration. It goes a step beyond mere discussion by compiling a thorough review of available dataset resources: statistics from 444 datasets spanning 32 domains and 8 language categories. This includes extensive data size metrics—more than 774.5 TB for pre-training corpora alone and 700 million instances across other dataset types.
This survey acts as a complete roadmap to guide researchers, serve as an invaluable resource, and inspire future studies in the LLM field.
Here’s the overall architecture of the survey
Also read: 10 Datasets by INDIAai for your Next Data Science Project
LLM Text Datasets Across Seven Dimensions
Here are the key types of LLM text datasets, categorized into seven main dimensions: Pre-training Corpora, Instruction Fine-tuning Datasets, Preference Datasets, Evaluation Datasets, Traditional NLP Datasets, Multi-modal Large Language Models (MLLMs) Datasets, and Retrieval Augmented Generation (RAG) Datasets. These categories are regularly updated for comprehensive coverage.
Note: I am using the same structure mentioned in the repo, and you can refer to the repo for the dataset information format.
It is like this –
- Dataset name Release Time | Public or Not | Language | Construction Method
| Paper | Github | Dataset | Website
- Publisher:
- Size:
- License:
- Source:
Repo Link: Awesome-LLMs-Datasets
1. Pre-training Corpora
These are extensive collections of text used during the initial training phase of LLMs.
A. General Pre-training Corpora: Large-scale datasets that include diverse text sources from various domains. They are designed to train foundational models that can perform various tasks due to their broad data coverage.
Webpages
- MADLAD-400 2023-9 | All | Multi (419) | HG |
Paper | Github | Dataset- Publisher: Google DeepMind et al.
- Size: 2.8 T Tokens
- License: ODL-BY
- Source: Common Crawl
- FineWeb 2024-4 | All | EN | CI |
Dataset- Publisher: HuggingFaceFW
- Size: 15 TB Tokens
- License: ODC-BY-1.0
- Source: Common Crawl
- CCI 2.0 2024-4 | All | ZH | HG |
Dataset1 | Dataset2- Publisher: BAAI
- Size: 501 GB
- License: CCI Usage Aggrement
- Source: Chinese webpages
- DCLM 2024-6 | All | EN | CI |
Paper | Github | Dataset | ai/dclm/” target=”_blank” rel=”noreferrer noopener nofollow”>Website- Publisher: University of Washington et al.
- Size: 279.6 TB
- License: Common Crawl Terms of Use
- Source: Common Crawl
Language Texts
- ANC 2003-x | All | EN | HG |
Website- Publisher: The US National Science Foundation et al.
- Size: –
- License: –
- Source: American English texts
- BNC 1994-x | All | EN | HG |
Website- Publisher: Oxford University Press et al.
- Size: 4124 Texts
- License: –
- Source: British English texts
- News-crawl 2019-1 | All | Multi (59) | HG |
Dataset- Publisher: UKRI et al.
- Size: 110 GB
- License: CC0
- Source: Newspapers
Books
- Anna’s Archive 2023-x | All | Multi | HG |
Website- Publisher: Anna
- Size: 586.3 TB
- License: –
- Source: Sci-Hub, Library Genesis, Z-Library, etc.
- BookCorpusOpen 2021-5 | All | EN | CI |
Paper | Github | Dataset- Publisher: Jack Bandy et al.
- Size: 17,868 Books
- License: Smashwords Terms of Service
- Source: Toronto Book Corpus
- PG-19 2019-11 | All | EN | HG |
Paper | Github | Dataset- Publisher: DeepMind
- Size: 11.74 GB
- License: Apache-2.0
- Source: Project Gutenberg
- Project Gutenberg 1971-x | All | Multi | HG |
Website- Publisher: Ibiblio et al.
- Size: –
- License: The Project Gutenberg
- Source: Ebook data
You can find more categories in this dimension here: General Pre-training Corpora
B. Domain-specific Pre-training Corpora: Customized datasets focused on specific fields or topics, used for targeted, incremental pre-training to enhance performance in specialized domains.
Financial
- BBT-FinCorpus 2023-2 | Partial | ZH | HG |
Paper | Github | Website- Publisher: Fudan University et al.
- Size: 256 GB
- License: –
- Source: Company announcements, research reports, financial
- Category: Multi
- Domain: Finance
- FinCorpus 2023-9 | All | ZH | HG |
Paper | Github | Dataset- Publisher: Du Xiaoman
- Size: 60.36 GB
- License: Apache-2.0
- Source: Company announcements, financial news, financial exam questions
- Category: Multi
- Domain: Finance
- FinGLM 2023-7 | All | ZH | HG |
Github- Publisher: Knowledge Atlas et al.
- Size: 69 GB
- License: Apache-2.0
- Source: Annual Reports of Listed Companies
- Category: Language Texts
- Domain: Finance
Medical
- Medical-pt 2023-5 | All | ZH | CI |
Github | Dataset- Publisher: Ming Xu
- Size: 632.78 MB
- License: Apache-2.0
- Source: Medical encyclopedia data, medical textbooks
- Category: Multi
- Domain: Medical
- PubMed Central 2000-2 | All | EN | HG |
Website- Publisher: NCBI
- Size: –
- License: PMC Copyright Notice
- Source: Biomedical scientific literature
- Category: Academic Materials
- Domain: Medical
Math
- Proof-Pile-2 2023-10 | All | EN | HG & CI |
Paper | Github | Dataset | ai/llemma/” target=”_blank” rel=”noreferrer noopener nofollow”>Website- Publisher: Princeton University et al.
- Size: 55 B Tokens
- License: –
- Source: ArXiv, OpenWebMath, AlgebraicStack
- Category: Multi
- Domain: Mathematics
- MathPile 2023-12 | All | EN | HG |
Paper | Github | Dataset- Publisher: Shanghai Jiao Tong University et al.
- Size: 9.5 B Tokens
- License: CC-BY-NC-SA-4.0
- Source: Textbooks, Wikipedia, ProofWiki, CommonCrawl, StackExchange, arXiv
- Category: Multi
- Domain: Mathematics
- OpenWebMath 2023-10 | All | EN | HG |
Paper | Github | Dataset- Publisher: University of Toronto et al.
- Size: 14.7 B Tokens
- License: ODC-BY-1.0
- Source: Common Crawl
- Category: Webpages
- Domain: Mathematics
You can find more categories in this dimension here: Domain-specific Pre-training Corpora
2. Instruction Fine-tuning Datasets
These datasets consist of pairs of “instruction inputs” (requests made to the model) and corresponding “answer outputs” (model-generated responses).
A. General Instruction Fine-tuning Datasets: Include a variety of instruction types without domain limitations. They aim to improve the model’s ability to follow instructions across general tasks.
Human Generated Datasets (HG)
- databricks-dolly-15K 2023-4 | All | EN | HG |
Dataset | Website- Publisher: Databricks
- Size: 15011 instances
- License: CC-BY-SA-3.0
- Source: Manually generated based on different instruction categories
- Instruction Category: Multi
- InstructionWild_v2 2023-6 | All | EN & ZH | HG |
Github- Publisher: National University of Singapore
- Size: 110K instances
- License: –
- Source: Collected on the web
- Instruction Category: Multi
- LCCC 2020-8 | All | ZH | HG |
Paper | Github- Publisher: Tsinghua University et al.
- Size: 12M instances
- License: MIT
- Source: Crawl user interactions on social media
- Instruction Category: Multi
Model Constructed Datasets (MC)
- Alpaca_data 2023-3 | All | EN | MC |
Github- Publisher: Stanford Alpaca
- Size: 52K instances
- License: Apache-2.0
- Source: Generated by Text-Davinci-003 with Aplaca_data prompts
- Instruction Category: Multi
- BELLE_Generated_Chat 2023-5 | All | ZH | MC |
Github | Dataset- Publisher: BELLE
- Size: 396004 instances
- License: GPL-3.0
- Source: Generated by ChatGPT
- Instruction Category: Generation
- BELLE_Multiturn_Chat 2023-5 | All | ZH | MC |
Github | Dataset- Publisher: BELLE
- Size: 831036 instances
- License: GPL-3.0
- Source: Generated by ChatGPT
- Instruction Category: Multi
You can find more categories in this dimension here: General Instruction Fine-tuning Datasets
B. Domain-specific Instruction Fine-tuning Datasets: Tailored for specific domains, containing instructions relevant to particular knowledge areas or task types.
Medical
- ChatDoctor 2023-3 | All | EN | HG & MC |
Paper | Github | Dataset- Publisher: University of Texas Southwestern Medical Center et al.
- Size: 115K instances
- License: Apache-2.0
- Source: Real conversations between doctors and patients & Generated by ChatGPT
- Instruction Category: Multi
- Domain: Medical
- ChatMed_Consult_Dataset 2023-5 | All | ZH | MC |
Github | Dataset- Publisher: michael-wzhu
- Size: 549326 instances
- License: CC-BY-NC-4.0
- Source: Generated by GPT-3.5-Turbo
- Instruction Category: Multi
- Domain: Medical
- CMtMedQA 2023-8 | All | ZH | HG |
Paper | Github | Dataset- Publisher: Zhengzhou University
- Size: 68023 instances
- License: MIT
- Source: Real conversations between doctors and patients
- Instruction Category: Multi
- Domain: Medical
Code
- Code_Alpaca_20K 2023-3 | All | EN & PL | MC |
Github | Dataset- Publisher: Sahil Chaudhary
- Size: 20K instances
- License: Apache-2.0
- Source: Generated by Text-Davinci-003
- Instruction Category: Code
- Domain: Code
- CodeContest 2022-3 | All | EN & PL | CI |
Paper | Github- Publisher: DeepMind
- Size: 13610 instances
- License: Apache-2.0
- Source: Collection and improvement of various datasets
- Instruction Category: Code
- Domain: Code
- CommitPackFT 2023-8 | All | EN & PL (277) | HG |
Paper | Github | Dataset- Publisher: Bigcode
- Size: 702062 instances
- License: MIT
- Source: GitHub Action dump
- Instruction Category: Code
- Domain: Code
You can find more categories in this dimension here: Domain-specific Instruction Fine-tuning Datasets
3. Preference Datasets
Preference datasets evaluate and refine model responses by providing comparative feedback on multiple outputs for the same input.
A. Preference Evaluation Methods: These can include methods such as voting, sorting, and scoring to establish how model responses align with human preferences.
Vote
- Chatbot_arena_conversations 2023-6 | All | Multi | HG & MC |
Paper | Dataset- Publisher: UC Berkeley et al.
- Size: 33000 instances
- License: CC-BY-4.0 & CC-BY-NC-4.0
- Domain: General
- Instruction Category: Multi
- Preference Evaluation Method: VO-H
- Source: Generated by twenty LLMs & Manual judgment
- hh-rlhf 2022-4 | All | EN | HG & MC |
Paper1 | Paper2 | Github | Dataset- Publisher: Anthropic
- Size: 169352 instances
- License: MIT
- Domain: General
- Instruction Category: Multi
- Preference Evaluation Method: VO-H
- Source: Generated by LLMs & Manual judgment
- MT-Bench_human_judgments 2023-6 | All | EN | HG & MC |
Paper | Github | Dataset | Website- Publisher: UC Berkeley et al.
- Size: 3.3K instances
- License: CC-BY-4.0
- Domain: General
- Instruction Category: Multi
- Preference Evaluation Method: VO-H
- Source: Generated by LLMs & Manual judgment
You can find more categories in this dimension here: Preference Evaluation Methods
4. Evaluation Datasets
These datasets are meticulously curated and annotated to measure the performance of LLMs on various tasks. They are categorized based on the domains they are used to evaluate.
General
- AlpacaEval 2023-5 | All | EN | CI & MC |
Paper | Github | Dataset | Website- Publisher: Stanford et al.
- Size: 805 instances
- License: Apache-2.0
- Question Type: SQ
- Evaluation Method: ME
- Focus: The performance on open-ended question answering
- Numbers of Evaluation Categories/Subcategories: 1/-
- Evaluation Category: Open-ended question answering
- BayLing-80 2023-6 | All | EN & ZH | HG & CI |
Paper | Github | Dataset- Publisher: Chinese Academy of Sciences
- Size: 320 instances
- License: GPL-3.0
- Question Type: SQ
- Evaluation Method: ME
- Focus: Chinese-English language proficiency and multimodal interaction skills
- Numbers of Evaluation Categories/Subcategories: 9/-
- Evaluation Category: Writing, Roleplay, Common-sense, Fermi, Counterfactual, Coding, Math, Generic, Knowledge
- BELLE_eval 2023-4 | All | ZH | HG & MC |
Paper | Github- Publisher: BELLE
- Size: 1000 instances
- License: Apache-2.0
- Question Type: SQ
- Evaluation Method: ME
- Focus: The performance of Chinese language models in following instructions
- Numbers of Evaluation Categories/Subcategories: 9/-
- Evaluation Category: Extract, Closed qa, Rewrite, Summarization, Generation, Classification, Brainstorming, Open qa, Others
You can find more categories in this dimension here: Evaluation Dataset
5. Traditional NLP Datasets
These datasets cover text used for natural language processing tasks prior to the era of LLMs. They are essential for tasks like language modelling, translation, and sentiment analysis in traditional NLP workflows.
Selection & Judgment
- BoolQ 2019-5 | EN |
Paper | Github- Publisher: University of Washington et al.
- Train/Dev/Test/All Size: 9427/3270/3245/15942
- License: CC-SA-3.0
- CosmosQA 2019-9 | EN | Paper | Github | Dataset | Website
- Publisher: University of Illinois Urbana-Champaign et al.
- Train/Dev/Test/All Size: 25588/3000/7000/35588
- License: CC-BY-4.0
- CondaQA 2022-11 | EN |
Paper | Github | Dataset- Publisher: Carnegie Mellon University et al.
- Train/Dev/Test/All Size: 5832/1110/7240/14182
- License: Apache-2.0
- PubMedQA 2019-9 | EN |
Paper | Github | Dataset | Website- Publisher: University of Pittsburgh et al.
- Train/Dev/Test/All Size: -/-/-/273.5K
- License: MIT
- MultiRC 2018-6 | EN |
Paper | Github | Dataset- Publisher: University of Pennsylvania et al.
- Train/Dev/Test/All Size: -/-/-/9872
- License: MultiRC License
You can find more categories in this dimension here: Traditional NLP Datasets
6. Multi-modal Large Language Models (MLLMs) Datasets
Datasets in this category integrate multiple data types, such as text and images, to train models capable of processing and generating responses across different modalities.
Documents
- mOSCAR: A large-scale multilingual and multimodal document-level corpus
- OBELISC: An open web-scale filtered dataset of interleaved image-text documents
Instruction Fine-tuning Datasets:
Remote Sensing
- MMRS-1M: Multi-sensor remote sensing instruction dataset
Images + Videos
- VideoChat2-IT: Instruction fine-tuning dataset for images/videos
You can find more categories in this dimension here: Multi-modal Large Language Models (MLLMs) Datasets
7. Retrieval Augmented Generation (RAG) Datasets
These datasets enhance LLMs with retrieval capabilities, enabling models to access and integrate external data sources for more informed and contextually relevant responses.
- CRUD-RAG: A comprehensive Chinese benchmark for RAG
- WikiEval: To do correlation analysis of difference metrics proposed in RAGAS
- RGB: A benchmark for RAG
- RAG-Instruct-Benchmark-Tester: An updated benchmarking test dataset for RAG use cases in the enterprise
You can find more categories in this dimension here: Retrieval Augmented Generation (RAG) Datasets
Conclusion
In conclusion, the comprehensive survey “Datasets for Large Language Models: A Comprehensive Survey” provides an invaluable roadmap for navigating the diverse and complex world of LLM datasets. This extensive review by Liu, Cao, Liu, Ding, and Jin showcases over 400 datasets, meticulously categorized into critical dimensions such as Pre-training Corpora, Instruction Fine-tuning Datasets, Preference Datasets, Evaluation Datasets, and others, covering over 774.5 TB of data and 700 million instances. By breaking down these datasets and their uses—from broad foundational pre-training sets to highly specialized, domain-specific collections—this survey highlights existing resources and maps out current challenges and future research directions in developing and optimising LLMs. This resource serves as both a guide for researchers entering the field and a reference for those aiming to enhance generative ai’s capabilities and application scopes.
Also, if you are looking for a Generative ai course online, then explore: GenAI Pinnacle Program
Frequently Asked Questions
Ans. Datasets for LLMs can be broadly categorized into structured data (e.g., tables, databases), unstructured data (e.g., text documents, books, articles), and semi-structured data (e.g., HTML, JSON). The most common are large-scale, diverse text datasets compiled from sources like websites, encyclopedias, and academic papers.
Ans. The training dataset’s quality, diversity, and size heavily impact an LLM’s performance. A well-curated dataset improves the model’s generalizability, comprehension, and bias reduction, while a poorly curated one can lead to inaccuracies and biased outputs.
Ans. Common sources include web scrapes from platforms like Wikipedia, news sites, books, research journals, and large-scale repositories like Common Crawl. Publicly available datasets such as The Pile or OpenWebText are also frequently used.
Ans. Mitigating data bias involves diversifying data sources, implementing fairness-aware data collection strategies, filtering content to reduce bias, and post-training fine-tuning. Regular audits and ethical reviews help identify and minimize biases during dataset creation.