A Guide to 400+ Categorized Large Language Model Datasets

You can find useful datasets on countless platforms—Kaggle, Paperwithcode, GitHub, and more. But what if I tell you there’s a goldmine: a repository packed with over 400+ datasets, meticulously categorised across five essential dimensions—Pre-training Corpora, Fine-tuning Instruction Datasets, Preference Datasets, Evaluation Datasets, and Traditional NLP Datasets and more? And to top it off, this collection receives regular updates. Sounds impressive, right?

These datasets were compiled by Yang Liu, Jiahuan Cao, Chongyu Liu, Kai Ding, and Lianwen Jin in their survey on the paper “Datasets for Large Language Models: A Comprehensive Survey,” which has just been released (February 2024). It offers a groundbreaking look at the backbone of large language model (LLM) development: datasets.

Note: I am providing you with a brief description of the datasets mentioned in the research paper; you can find all the datasets in the repo.

400+ Datasets for Your GenAI_LLMs Project

Datasets for Your GenAI/LLMs Project: Abstract Overview of the Paper

Source: Datasets for Large Language Models: A Comprehensive Survey

This paper sets out to navigate the intricate landscape of LLM datasets, which are the cornerstone behind the stellar evolution of these models. Just as the roots of a tree provide the necessary support and nutrients for growth, datasets are fundamental to LLMs. Thus, studying these datasets isn’t just relevant; it’s essential.

Given the current gaps in comprehensive analysis and overview, this survey organises and categorises the essential types of LLM datasets into five primary perspectives:

Pre-training Corpora
Instruction Fine-tuning Datasets
Preference Datasets
Evaluation Datasets
Traditional Natural Language Processing (NLP) Datasets
Multi-modal Large Language Models (MLLMs) Datasets
Retrieval Augmented Generation (RAG) Datasets.

The research outlines the key challenges that exist today and suggests potential directions for further exploration. It goes a step beyond mere discussion by compiling a thorough review of available dataset resources: statistics from 444 datasets spanning 32 domains and 8 language categories. This includes extensive data size metrics—more than 774.5 TB for pre-training corpora alone and 700 million instances across other dataset types.

This survey acts as a complete roadmap to guide researchers, serve as an invaluable resource, and inspire future studies in the LLM field.

Here’s the overall architecture of the survey

Also read: 10 Datasets by INDIAai for your Next Data Science Project

LLM Text Datasets Across Seven Dimensions

Here are the key types of LLM text datasets, categorized into seven main dimensions: Pre-training Corpora, Instruction Fine-tuning Datasets, Preference Datasets, Evaluation Datasets, Traditional NLP Datasets, Multi-modal Large Language Models (MLLMs) Datasets, and Retrieval Augmented Generation (RAG) Datasets. These categories are regularly updated for comprehensive coverage.

Note: I am using the same structure mentioned in the repo, and you can refer to the repo for the dataset information format.

It is like this –

- Dataset name  Release Time | Public or Not | Language | Construction Method
| Paper | Github | Dataset | Website
- Publisher:
- Size:
- License:
- Source:

Repo Link: Awesome-LLMs-Datasets

1. Pre-training Corpora

These are extensive collections of text used during the initial training phase of LLMs.

A. General Pre-training Corpora: Large-scale datasets that include diverse text sources from various domains. They are designed to train foundational models that can perform various tasks due to their broad data coverage.

Webpages

MADLAD-400 2023-9 | All | Multi (419) | HG |
Paper | Github | Dataset
- Publisher: Google DeepMind et al.
- Size: 2.8 T Tokens
- License: ODL-BY
- Source: Common Crawl
FineWeb 2024-4 | All | EN | CI |
Dataset
- Publisher: HuggingFaceFW
- Size: 15 TB Tokens
- License: ODC-BY-1.0
- Source: Common Crawl
CCI 2.0 2024-4 | All | ZH | HG |
Dataset1 | Dataset2
- Publisher: BAAI
- Size: 501 GB
- License: CCI Usage Aggrement
- Source: Chinese webpages
DCLM 2024-6 | All | EN | CI |
Paper | Github | Dataset | ai/dclm/” target=”_blank” rel=”noreferrer noopener nofollow”>Website
- Publisher: University of Washington et al.
- Size: 279.6 TB
- License: Common Crawl Terms of Use
- Source: Common Crawl

Language Texts

ANC 2003-x | All | EN | HG |
Website
- Publisher: The US National Science Foundation et al.
- Size: –
- License: –
- Source: American English texts
BNC 1994-x | All | EN | HG |
Website
- Publisher: Oxford University Press et al.
- Size: 4124 Texts
- License: –
- Source: British English texts
News-crawl 2019-1 | All | Multi (59) | HG |
Dataset
- Publisher: UKRI et al.
- Size: 110 GB
- License: CC0
- Source: Newspapers

Books

Anna’s Archive 2023-x | All | Multi | HG |
Website
- Publisher: Anna
- Size: 586.3 TB
- License: –
- Source: Sci-Hub, Library Genesis, Z-Library, etc.
BookCorpusOpen 2021-5 | All | EN | CI |
Paper | Github | Dataset
- Publisher: Jack Bandy et al.
- Size: 17,868 Books
- License: Smashwords Terms of Service
- Source: Toronto Book Corpus
PG-19 2019-11 | All | EN | HG |
Paper | Github | Dataset
- Publisher: DeepMind
- Size: 11.74 GB
- License: Apache-2.0
- Source: Project Gutenberg
Project Gutenberg 1971-x | All | Multi | HG |
Website
- Publisher: Ibiblio et al.
- Size: –
- License: The Project Gutenberg
- Source: Ebook data

You can find more categories in this dimension here: General Pre-training Corpora

B. Domain-specific Pre-training Corpora: Customized datasets focused on specific fields or topics, used for targeted, incremental pre-training to enhance performance in specialized domains.

Financial

BBT-FinCorpus 2023-2 | Partial | ZH | HG |
Paper | Github | Website
- Publisher: Fudan University et al.
- Size: 256 GB
- License: –
- Source: Company announcements, research reports, financial
- Category: Multi
- Domain: Finance
FinCorpus 2023-9 | All | ZH | HG |
Paper | Github | Dataset
- Publisher: Du Xiaoman
- Size: 60.36 GB
- License: Apache-2.0
- Source: Company announcements, financial news, financial exam questions
- Category: Multi
- Domain: Finance
FinGLM 2023-7 | All | ZH | HG |
Github
- Publisher: Knowledge Atlas et al.
- Size: 69 GB
- License: Apache-2.0
- Source: Annual Reports of Listed Companies
- Category: Language Texts
- Domain: Finance

Medical

Medical-pt 2023-5 | All | ZH | CI |
Github | Dataset
- Publisher: Ming Xu
- Size: 632.78 MB
- License: Apache-2.0
- Source: Medical encyclopedia data, medical textbooks
- Category: Multi
- Domain: Medical
PubMed Central 2000-2 | All | EN | HG |
Website
- Publisher: NCBI
- Size: –
- License: PMC Copyright Notice
- Source: Biomedical scientific literature
- Category: Academic Materials
- Domain: Medical

Math

Proof-Pile-2 2023-10 | All | EN | HG & CI |
Paper | Github | Dataset | ai/llemma/” target=”_blank” rel=”noreferrer noopener nofollow”>Website
- Publisher: Princeton University et al.
- Size: 55 B Tokens
- License: –
- Source: ArXiv, OpenWebMath, AlgebraicStack
- Category: Multi
- Domain: Mathematics
MathPile 2023-12 | All | EN | HG |
Paper | Github | Dataset
- Publisher: Shanghai Jiao Tong University et al.
- Size: 9.5 B Tokens
- License: CC-BY-NC-SA-4.0
- Source: Textbooks, Wikipedia, ProofWiki, CommonCrawl, StackExchange, arXiv
- Category: Multi
- Domain: Mathematics
OpenWebMath 2023-10 | All | EN | HG |
Paper | Github | Dataset
- Publisher: University of Toronto et al.
- Size: 14.7 B Tokens
- License: ODC-BY-1.0
- Source: Common Crawl
- Category: Webpages
- Domain: Mathematics

You can find more categories in this dimension here: Domain-specific Pre-training Corpora

2. Instruction Fine-tuning Datasets

These datasets consist of pairs of “instruction inputs” (requests made to the model) and corresponding “answer outputs” (model-generated responses).

A. General Instruction Fine-tuning Datasets: Include a variety of instruction types without domain limitations. They aim to improve the model’s ability to follow instructions across general tasks.

Human Generated Datasets (HG)

databricks-dolly-15K 2023-4 | All | EN | HG |
Dataset | Website
- Publisher: Databricks
- Size: 15011 instances
- License: CC-BY-SA-3.0
- Source: Manually generated based on different instruction categories
- Instruction Category: Multi
InstructionWild_v2 2023-6 | All | EN & ZH | HG |
Github
- Publisher: National University of Singapore
- Size: 110K instances
- License: –
- Source: Collected on the web
- Instruction Category: Multi
LCCC 2020-8 | All | ZH | HG |
Paper | Github
- Publisher: Tsinghua University et al.
- Size: 12M instances
- License: MIT
- Source: Crawl user interactions on social media
- Instruction Category: Multi

Model Constructed Datasets (MC)

Alpaca_data 2023-3 | All | EN | MC |
Github
- Publisher: Stanford Alpaca
- Size: 52K instances
- License: Apache-2.0
- Source: Generated by Text-Davinci-003 with Aplaca_data prompts
- Instruction Category: Multi
BELLE_Generated_Chat 2023-5 | All | ZH | MC |
Github | Dataset
- Publisher: BELLE
- Size: 396004 instances
- License: GPL-3.0
- Source: Generated by ChatGPT
- Instruction Category: Generation
BELLE_Multiturn_Chat 2023-5 | All | ZH | MC |
Github | Dataset
- Publisher: BELLE
- Size: 831036 instances
- License: GPL-3.0
- Source: Generated by ChatGPT
- Instruction Category: Multi

You can find more categories in this dimension here: General Instruction Fine-tuning Datasets

B. Domain-specific Instruction Fine-tuning Datasets: Tailored for specific domains, containing instructions relevant to particular knowledge areas or task types.

Medical

ChatDoctor 2023-3 | All | EN | HG & MC |
Paper | Github | Dataset
- Publisher: University of Texas Southwestern Medical Center et al.
- Size: 115K instances
- License: Apache-2.0
- Source: Real conversations between doctors and patients & Generated by ChatGPT
- Instruction Category: Multi
- Domain: Medical
ChatMed_Consult_Dataset 2023-5 | All | ZH | MC |
Github | Dataset
- Publisher: michael-wzhu
- Size: 549326 instances
- License: CC-BY-NC-4.0
- Source: Generated by GPT-3.5-Turbo
- Instruction Category: Multi
- Domain: Medical
CMtMedQA 2023-8 | All | ZH | HG |
Paper | Github | Dataset
- Publisher: Zhengzhou University
- Size: 68023 instances
- License: MIT
- Source: Real conversations between doctors and patients
- Instruction Category: Multi
- Domain: Medical

Code

Code_Alpaca_20K 2023-3 | All | EN & PL | MC |
Github | Dataset
- Publisher: Sahil Chaudhary
- Size: 20K instances
- License: Apache-2.0
- Source: Generated by Text-Davinci-003
- Instruction Category: Code
- Domain: Code
CodeContest 2022-3 | All | EN & PL | CI |
Paper | Github
- Publisher: DeepMind
- Size: 13610 instances
- License: Apache-2.0
- Source: Collection and improvement of various datasets
- Instruction Category: Code
- Domain: Code
CommitPackFT 2023-8 | All | EN & PL (277) | HG |
Paper | Github | Dataset
- Publisher: Bigcode
- Size: 702062 instances
- License: MIT
- Source: GitHub Action dump
- Instruction Category: Code
- Domain: Code

You can find more categories in this dimension here: Domain-specific Instruction Fine-tuning Datasets

3. Preference Datasets

Preference datasets evaluate and refine model responses by providing comparative feedback on multiple outputs for the same input.

A. Preference Evaluation Methods: These can include methods such as voting, sorting, and scoring to establish how model responses align with human preferences.

Vote

Chatbot_arena_conversations 2023-6 | All | Multi | HG & MC |
Paper | Dataset
- Publisher: UC Berkeley et al.
- Size: 33000 instances
- License: CC-BY-4.0 & CC-BY-NC-4.0
- Domain: General
- Instruction Category: Multi
- Preference Evaluation Method: VO-H
- Source: Generated by twenty LLMs & Manual judgment
hh-rlhf 2022-4 | All | EN | HG & MC |
Paper1 | Paper2 | Github | Dataset
- Publisher: Anthropic
- Size: 169352 instances
- License: MIT
- Domain: General
- Instruction Category: Multi
- Preference Evaluation Method: VO-H
- Source: Generated by LLMs & Manual judgment
MT-Bench_human_judgments 2023-6 | All | EN | HG & MC |
Paper | Github | Dataset | Website
- Publisher: UC Berkeley et al.
- Size: 3.3K instances
- License: CC-BY-4.0
- Domain: General
- Instruction Category: Multi
- Preference Evaluation Method: VO-H
- Source: Generated by LLMs & Manual judgment

You can find more categories in this dimension here: Preference Evaluation Methods

4. Evaluation Datasets

These datasets are meticulously curated and annotated to measure the performance of LLMs on various tasks. They are categorized based on the domains they are used to evaluate.

General

AlpacaEval 2023-5 | All | EN | CI & MC |
Paper | Github | Dataset | Website
- Publisher: Stanford et al.
- Size: 805 instances
- License: Apache-2.0
- Question Type: SQ
- Evaluation Method: ME
- Focus: The performance on open-ended question answering
- Numbers of Evaluation Categories/Subcategories: 1/-
- Evaluation Category: Open-ended question answering
BayLing-80 2023-6 | All | EN & ZH | HG & CI |
Paper | Github | Dataset
- Publisher: Chinese Academy of Sciences
- Size: 320 instances
- License: GPL-3.0
- Question Type: SQ
- Evaluation Method: ME
- Focus: Chinese-English language proficiency and multimodal interaction skills
- Numbers of Evaluation Categories/Subcategories: 9/-
- Evaluation Category: Writing, Roleplay, Common-sense, Fermi, Counterfactual, Coding, Math, Generic, Knowledge
BELLE_eval 2023-4 | All | ZH | HG & MC |
Paper | Github
- Publisher: BELLE
- Size: 1000 instances
- License: Apache-2.0
- Question Type: SQ
- Evaluation Method: ME
- Focus: The performance of Chinese language models in following instructions
- Numbers of Evaluation Categories/Subcategories: 9/-
- Evaluation Category: Extract, Closed qa, Rewrite, Summarization, Generation, Classification, Brainstorming, Open qa, Others

You can find more categories in this dimension here: Evaluation Dataset

5. Traditional NLP Datasets

These datasets cover text used for natural language processing tasks prior to the era of LLMs. They are essential for tasks like language modelling, translation, and sentiment analysis in traditional NLP workflows.

Selection & Judgment

BoolQ 2019-5 | EN |
Paper | Github
- Publisher: University of Washington et al.
- Train/Dev/Test/All Size: 9427/3270/3245/15942
- License: CC-SA-3.0
CosmosQA 2019-9 | EN | Paper | Github | Dataset | Website
- Publisher: University of Illinois Urbana-Champaign et al.
- Train/Dev/Test/All Size: 25588/3000/7000/35588
- License: CC-BY-4.0
CondaQA 2022-11 | EN |
Paper | Github | Dataset
- Publisher: Carnegie Mellon University et al.
- Train/Dev/Test/All Size: 5832/1110/7240/14182
- License: Apache-2.0
PubMedQA 2019-9 | EN |
Paper | Github | Dataset | Website
- Publisher: University of Pittsburgh et al.
- Train/Dev/Test/All Size: -/-/-/273.5K
- License: MIT
MultiRC 2018-6 | EN |
Paper | Github | Dataset
- Publisher: University of Pennsylvania et al.
- Train/Dev/Test/All Size: -/-/-/9872
- License: MultiRC License

You can find more categories in this dimension here: Traditional NLP Datasets

Datasets in this category integrate multiple data types, such as text and images, to train models capable of processing and generating responses across different modalities.

Documents

mOSCAR: A large-scale multilingual and multimodal document-level corpus
OBELISC: An open web-scale filtered dataset of interleaved image-text documents

Instruction Fine-tuning Datasets:

Remote Sensing

MMRS-1M: Multi-sensor remote sensing instruction dataset

Images + Videos

VideoChat2-IT: Instruction fine-tuning dataset for images/videos

You can find more categories in this dimension here: Multi-modal Large Language Models (MLLMs) Datasets

7. Retrieval Augmented Generation (RAG) Datasets

These datasets enhance LLMs with retrieval capabilities, enabling models to access and integrate external data sources for more informed and contextually relevant responses.

CRUD-RAG: A comprehensive Chinese benchmark for RAG
WikiEval: To do correlation analysis of difference metrics proposed in RAGAS
RGB: A benchmark for RAG
RAG-Instruct-Benchmark-Tester: An updated benchmarking test dataset for RAG use cases in the enterprise

You can find more categories in this dimension here: Retrieval Augmented Generation (RAG) Datasets

Conclusion

In conclusion, the comprehensive survey “Datasets for Large Language Models: A Comprehensive Survey” provides an invaluable roadmap for navigating the diverse and complex world of LLM datasets. This extensive review by Liu, Cao, Liu, Ding, and Jin showcases over 400 datasets, meticulously categorized into critical dimensions such as Pre-training Corpora, Instruction Fine-tuning Datasets, Preference Datasets, Evaluation Datasets, and others, covering over 774.5 TB of data and 700 million instances. By breaking down these datasets and their uses—from broad foundational pre-training sets to highly specialized, domain-specific collections—this survey highlights existing resources and maps out current challenges and future research directions in developing and optimising LLMs. This resource serves as both a guide for researchers entering the field and a reference for those aiming to enhance generative ai’s capabilities and application scopes.

Also, if you are looking for a Generative ai course online, then explore: GenAI Pinnacle Program

Frequently Asked Questions

Q1. What are the main types of datasets used for training LLMs?

Ans. Datasets for LLMs can be broadly categorized into structured data (e.g., tables, databases), unstructured data (e.g., text documents, books, articles), and semi-structured data (e.g., HTML, JSON). The most common are large-scale, diverse text datasets compiled from sources like websites, encyclopedias, and academic papers.

Q2. How do datasets impact the quality of an LLM?

Ans. The training dataset’s quality, diversity, and size heavily impact an LLM’s performance. A well-curated dataset improves the model’s generalizability, comprehension, and bias reduction, while a poorly curated one can lead to inaccuracies and biased outputs.

Q3. What are common sources for LLM datasets?

Ans. Common sources include web scrapes from platforms like Wikipedia, news sites, books, research journals, and large-scale repositories like Common Crawl. Publicly available datasets such as The Pile or OpenWebText are also frequently used.

Q4. How do you handle data bias in LLM datasets?

Ans. Mitigating data bias involves diversifying data sources, implementing fairness-aware data collection strategies, filtering content to reduce bias, and post-training fine-tuning. Regular audits and ethical reviews help identify and minimize biases during dataset creation.

Pankaj Singh

Hi, I am Pankaj Singh Negi – Senior Content Editor | Passionate about storytelling and crafting compelling narratives that transform ideas into impactful content. I love reading about technology revolutionizing our lifestyle.

A Guide to 400+ Categorized Large Language Model Datasets

Technical Terrence Team

Israel Stock Markets Close Higher; The TA 35 rises 2.60% By Investing.com

Leave a Reply Cancel reply

Recommended.

Price Movement and Key Levels

PSA: Your chat and calling apps can leak your IP address

Ethereum's L2 ecosystem performance will surpass Solana's by 100x in 5 years, says analyst

How to spend Bitcoin in Crypto Emporium: the best cryptocurrency market?

NFT Lending Market to Top $2 Billion in Q1 2024

Categories

Important Links

A Guide to 400+ Categorized Large Language Model Datasets

Datasets for Your GenAI/LLMs Project: Abstract Overview of the Paper

LLM Text Datasets Across Seven Dimensions

1. Pre-training Corpora

Webpages

Language Texts

Books

Financial

Medical

Math

2. Instruction Fine-tuning Datasets

Human Generated Datasets (HG)

Model Constructed Datasets (MC)

Medical

Code

3. Preference Datasets

Vote

4. Evaluation Datasets

General

5. Traditional NLP Datasets

Selection & Judgment

6. Multi-modal Large Language Models (MLLMs) Datasets

Documents

Remote Sensing

Images + Videos

7. Retrieval Augmented Generation (RAG) Datasets

Conclusion

Frequently Asked Questions

Related

Technical Terrence Team

Israel Stock Markets Close Higher; The TA 35 rises 2.60% By Investing.com

Leave a Reply Cancel reply

Recommended.

Price Movement and Key Levels

PSA: Your chat and calling apps can leak your IP address

Ethereum's L2 ecosystem performance will surpass Solana's by 100x in 5 years, says analyst

How to spend Bitcoin in Crypto Emporium: the best cryptocurrency market?

NFT Lending Market to Top $2 Billion in Q1 2024

Categories

Important Links

Get daily news updates to your inbox!