Source2Synth: a new artificial intelligence technique for generating and curating synthetic data based on real data sources

Large language models (LLMs) have demonstrated impressive performance on tasks such as natural language processing, text generation, and text synthesis. However, they still struggle under more challenging circumstances. These are tasks that require the use of tools to solve problems, handle structured data, or perform complex multi-step reasoning. For example, while LLMs are adept at understanding unstructured text, they struggle to use and interpret organized data such as spreadsheets, tables, and databases. In addition, they frequently underperform on tasks such as multi-hop question answering (MHQA), which requires combining data from multiple sources. Similarly, LLMs still find it challenging to complete tasks that require the use of tools, including using SQL to answer tabular queries.

To overcome these problems, researchers from Meta, the University of Oxford and University College London have introduced a new technique called Source2Synth. The main benefit of Source2Synth is its ability to impart new skills to master’s students without the need for expensive and time-consuming human annotations. Conventional approaches to improving master’s student performance often require a large amount of manual annotations, which are costly and difficult to scale, particularly for complicated assignments. Source2Synth has eliminated this requirement, as it creates synthetic data that mimics real-world situations and thought processes.

To create synthetic instances with intermediate reasoning steps, Source2Synth uses a specific data source, such as tables from the Internet or relevant articles. Since these examples are based on real data, the synthetic data is guaranteed to be diverse, realistic, and factually correct. The main step of the method is to create a seed topic, which can be an entity or a factual statement, and then develop it into a complete example. The example contains the instructions for the task, the steps required to solve the problem through reasoning, and the solution. Through this procedure, Source2Synth can generate intricate and realistic data points that mimic the way LLMs must handle structured data or carry out multi-step activities.

The method that Source2Synth uses to improve the quality of datasets is an essential component. Low-quality examples can degrade model performance, and not all generated data points are equally valuable. To address this issue, Source2Synth uses filtering strategies determined by the responsiveness of the synthetic instances. For example, the example is discarded if the generated data does not result in the correct answer within a given number of trials. This quality control procedure ensures that only excellent examples – those that help the LLM student acquire the necessary skills – are retained for the final round of fine-tuning.

The technique has been implemented in two unique and demanding fields, which are as follows:

Multi-hop Question Answering (MHQA): To answer a single question, the LLM in this domain analyzes and synthesizes data from multiple sources. When Source2Synth was evaluated on HotPotQA, a dataset built for multi-hop reasoning, it outperformed baseline models that were tuned using conventional techniques by 22.57%.

Answering questions with structured data is known as tabular query answering (TQA) and often requires SQL queries to communicate with tables. WikiSQL is a dataset that focuses on using SQL to answer questions about tables. Source2Synth was tested on it and achieved a 25.51% improvement over baseline models.

The results have demonstrated how Source2Synth can boost LLM performance on challenging tasks without requiring large amounts of human annotations on the datasets. For training LLM in domains that require sophisticated reasoning and tool usage, Source2Synth offers a scalable method by producing realistic, well-grounded examples and rigorously filtering the dataset to ensure high quality.

In conclusion, Source2Synth is a unique method for imparting new knowledge to graduate students, particularly in situations where human annotation is not feasible. This strategy addresses current limitations of graduate students in complicated tasks such as multi-step reasoning and structured data manipulation by ensuring that only high-quality examples are used for fine-tuning and by basing synthetic data generation on real-world sources for validation.

Take a look at the PaperAll credit for this research goes to the researchers of this project. Also, don't forget to follow us on twitter.com/Marktechpost”>twitter and join our Telegram Channel and LinkedIn GrAbove!. If you like our work, you will love our fact sheet..

Don't forget to join our SubReddit of over 50,000 ml

FREE ai WEBINAR: 'SAM 2 for Video: How to Optimize Your Data' (Wednesday, September 25, 4:00 am – 4:45 am EST)

Tanya Malhotra is a final year student of the University of Petroleum and Energy Studies, Dehradun, pursuing BTech in Computer Engineering with specialization in artificial intelligence and Machine Learning.
She is a data science enthusiast with good analytical and critical thinking skills, along with a keen interest in acquiring new skills, leading groups, and managing work in an organized manner.

FREE ai WEBINAR: 'SAM 2 for Video: How to Optimize Your Data' (Wednesday, September 25, 4:00 am – 4:45 am EST)

Source2Synth: a new artificial intelligence technique for generating and curating synthetic data based on real data sources

Technical Terrence Team

The consequences of the strike at Boeing continue with staff layoffs

Leave a Reply Cancel reply

Recommended.

Language, Vision and Generative Models – Google AI Blog

A Recession Is Coming: Here’s How It’s Feeding Bitcoin

The Verge’s Favorite Holiday Gifts Under $50

Retailers linked to Etsy and SVB face late payments

Analyst suggests Bitcoin price recovery could be underway – here's why

Categories

Important Links

Source2Synth: a new artificial intelligence technique for generating and curating synthetic data based on real data sources

Related

Technical Terrence Team

The consequences of the strike at Boeing continue with staff layoffs

Leave a Reply Cancel reply

Recommended.

Language, Vision and Generative Models – Google AI Blog

A Recession Is Coming: Here’s How It’s Feeding Bitcoin

The Verge’s Favorite Holiday Gifts Under $50

Retailers linked to Etsy and SVB face late payments

Analyst suggests Bitcoin price recovery could be underway – here's why

Categories

Important Links

Get daily news updates to your inbox!