Synthetic data generation has become crucial in training large language models (LLMs). This field focuses on creating artificial datasets that mimic real-world data, allowing researchers to effectively train and evaluate machine learning models without compromising privacy or requiring extensive data collection efforts. The methodology behind synthetic data creation aims to provide diverse and scalable datasets to improve the robustness and performance of LLMs in various applications.
The main challenge in generating synthetic data lies in creating diverse data at a large scale. Traditional methods often struggle to maintain both diversity and scalability. Instance-based approaches, which generate new data based on a seed corpus, are limited by the diversity of the original dataset. Keypoint-based methods attempt to diversify synthetic data by leveraging a curated list of keypoints, but this process is difficult to scale across different domains due to the exhaustive selection required. As a result, these methods often fail to produce datasets that can cover a wide range of scenarios and use cases.
Current methods for generating synthetic data typically involve instance- and keypoint-based approaches. Instance-based methods use an initial corpus to create new instances, but their diversity is limited by the initial corpus. Keypoint-based methods rely on a comprehensive list of keypoints, which is difficult to exhaustively curate and limits the scope to specific domains. These methods, while useful, often fail to produce sufficiently diverse and scalable synthetic datasets, necessary for advanced LLM training and application.
Tencent ai Lab researchers introduced Persona Hub, a novel persona-based data synthesis methodology. This approach leverages a collection of 1 billion diverse personas, automatically selected from web data, to generate synthetic data. Persona Hub enables LLMs to create data from multiple perspectives, improving diversity and scalability. By associating synthetic data cues with specific personas, this methodology can guide LLMs toward creating distinct and varied data sets, overcoming the limitations of previous methods.
Persona Hub is comprised of 1 billion people representing 13% of the world’s population, each associated with unique knowledge, experiences, interests, and professions. This collection enables the generation of synthetic data in diverse scenarios by prompting LLMs with specific personas. Personas act as distributed carriers of global knowledge, guiding LLMs to produce diverse and contextually rich synthetic data. Researchers developed scalable approaches to derive these personas from massive web data, using both text-to-persona and person-to-person methods. The text-to-persona approach infers personas from specific texts, while the person-to-person approach expands the diversity of personas through interpersonal relationships.
The persona-based approach produced impressive quantitative results. The researchers created 50,000 math problems, 50,000 logical reasoning problems, 50,000 instructions, 10,000 knowledge-rich texts, 10,000 non-player game characters, and 5,000 tools. In evaluations, a model fine-tuned with 1.07 million synthetic math problems achieved 79.4% accuracy on a distributed test set of 11,600 instances, outperforming all open-source LLMs tested. On the MATH benchmark, the model achieved 64.9% accuracy, matching the performance of gpt-4-turbo-preview, demonstrating significant improvements in LLM capabilities through persona-based data synthesis.
The researchers highlighted substantial improvements in LLM performance and the profound impact of persona-based data synthesis on LLM training and development. By leveraging the billion personas in Persona Hub, the researchers were able to create diverse synthetic data sets that significantly improve LLM capabilities. This methodology proved effective in various data synthesis scenarios, demonstrating its potential to become standard practice in synthetic data generation.
The researchers’ persona-based methodology for generating synthetic data addresses the limitations of traditional methods by introducing a scalable and diverse approach. Persona Hub’s extensive collection of personas facilitates the creation of rich and varied synthetic data, advancing the field of LLM training and applications. This innovative method promises to enhance the capabilities of LLMs and expand their real-world applicability. By providing a robust solution to the challenges of synthetic data generation, this research has the potential to drive significant advances in artificial intelligence and machine learning.
Review the Paper and GitHub. All credit for this research goes to the researchers of this project. Also, don't forget to follow us on twitter.com/Marktechpost”>twitter.
Join our Telegram Channel and LinkedIn GrAbove!.
If you like our work, you will love our Newsletter..
Don't forget to join our Subreddit with over 45 billion users
Nikhil is a Consultant Intern at Marktechpost. He is pursuing an integrated dual degree in Materials from Indian Institute of technology, Kharagpur. Nikhil is an ai and Machine Learning enthusiast who is always researching applications in fields like Biomaterials and Biomedical Science. With a strong background in Materials Science, he is exploring new advancements and creating opportunities to contribute.
<script async src="//platform.twitter.com/widgets.js” charset=”utf-8″>