Hiring human annotators was an expensive and time-consuming technique and was traditionally used to create data sets for supervised tuning and instruction tuning. Due to the high cost, only a few influential people in the area were able to create such complete data sets. However, things have changed in recent months. Numerous world-class synthetic fine-tuning datasets have been developed, with GPT-3.5 and GPT-4 being the most common tools.
The Phi models developed by Microsoft were pioneers in this area; they relied heavily on synthetic data for training. These outperformed larger models trained on web datasets for longer periods. With over 617,000 downloads in the last 30 days, Phi-2 is among the top 20 most popular models on the Hugging Face hub.
Another drawback is the use of proprietary models to produce the data, in addition to the fact that very little is known about how the Phi data sets emerged. Hugging Face researchers present cosmopedia, a database of synthetic textbooks, blog entries, stories, blogs, and WikiHow articles produced by Mixtral-8x7B-Instruct-v0.1. It is the largest open synthetic data set to date, with over 25 billion tokens and 30 million files.
While creating synthetic data may seem simple, it is very difficult to scale it up while preserving diversity, which is critical for maximum performance. In this work, the team generated more than 30 million cosmopedia posts covering hundreds of topics with a duplicate content rate of less than 1%.
Cosmopedia'The ultimate goal is to provide a huge amount of complete, excellent quality synthetic data. To construct Cosmopedia prompts, the researchers combined two methods: conditioning on online data and conditioning on selected sources. They called this “seed data,” the original set of information used to create their conditions.
Selected sources: Topics come from trusted educational resources, including OpenStax, WikiHow, Stanford courses, and Khan. The main shortcoming of this strategy is its inability to scale, even though it produces high-quality content.
Taking advantage of variability in audience and generation style, it is possible to generate single-topic samples in different formats (e.g., academic textbook vs. blog post) and for different audiences (e.g., young children vs. University students).
Web data: Since web data accounts for more than 80% of Cosmopedia'Based on the guidance, it was clear that this approach was the most scalable. Using a dataset similar to RefinedWeb, the researchers organized millions of online samples into 145 groups. For each group, they determined their theme by giving Mixtral excerpts from 10 randomly selected samples and asking them to identify their common theme.
After reviewing the groups, they eliminated those that did not meet the standards for educational value. Obituaries, explicit adult content, and celebrity gossip are some examples of content that has been removed. They continued by telling the model to create a textbook according to the topic of a web sample based on its grouping, and then constructed prompts.
The team conditioned prompts to topic only half of the time and modified audience and generation styles to promote diversity and account for any incomplete topic labeling. They used this method to create 23 million messages in the end.
Preliminary evaluations of models educated using the textbooks produced revealed an absence of basic knowledge and common sense typical of a primary school curriculum. To address this, the researchers used texts from the UltraChat and OpenHermes2.5 instruction tuning datasets as initial data for the prompts and constructed stories that incorporate common sense and everyday knowledge. These data sets cover a wide variety of topics.
The team used the text clustering repository to apply topic clustering to online data used in Cosmopedia prompts. To create 25 billion synthetic content tokens using Mixtral-8x7B-Instruct-v0.1, they use the llm-swarm library. Hugging Face Hub is used by this scalable synthetic data generation tool, which uses local LLM or inference endpoints. It supports vLLM and TGI inference libraries. In the Hugging Face Science cluster, TGI was used to locally deploy Mixtral-8x7B on H100 GPUs. It took over 10,000 GPU hours to generate Cosmopedia.
The team highlights that there is a possibility that seed samples or model training data could be contaminated with benchmarks because this is synthetic data. They employ a decontamination pathway to remove test reference samples from their data set to overcome this problem.
Using a 10-gram overlay, they were able to detect samples that could be contaminated, just like Phi-1. After candidate retrieval, researchers compare the dataset sample with the benchmark using difflib.SequenceMatcher. They remove the sample if the ratio of the length of the matching substrings to the length of the reference sample is greater than 0.5. All benchmarks that were tested using the Cosmo-1B model, such as MMLU, HellaSwag, PIQA, SIQA, Winogrande, OpenBookQA, ARC-Easy, and ARC-Challenge, passed this decontamination procedure.
For data deduplication and tokenization, they used the datatrove package. Model training was carried out using nanotron and evaluation was performed using lighteval.
The model outperforms TinyLlama 1.1B in MMLU, ARC-easy, OpenBookQA and ARC-challenge, and is on par with Qwen-1.5-1B in OpenBookQA and ARC-challenge. However, there are notable performance differences compared to Phi-1.5, indicating higher quality synthetic generation. These differences could be attributed to the LLM used for generation, topic coverage, or directions.
Dhanshree Shenwai is a Computer Science Engineer and has good experience in FinTech companies covering Finance, Cards & Payments and Banking with a keen interest in ai applications. He is excited to explore new technologies and advancements in today's evolving world that makes life easier for everyone.
<!– ai CONTENT END 2 –>