Generative ai is receiving a lot of attention for its ability to create text and images. But these media represent only a fraction of the data that proliferates in our society today. Data is generated every time a patient passes through a medical system, a storm impacts a flight, or a person interacts with a software application.
Using generative ai to create realistic synthetic data around those scenarios can help organizations treat patients, divert planes, or improve software platforms more effectively, especially in scenarios where real-world data is limited. or sensitive.
For the past three years, MIT spinoff DataCebo has offered a generative software system called Synthetic Data Vault to help organizations create synthetic data to do things like test software applications and train machine learning models.
Synthetic Data Vault, or SDV, has been downloaded more than 1 million times, and more than 10,000 data scientists have used the open source library to generate synthetic tabular data. The founders, principal research scientist Kalyan Veeramachaneni and alumna Neha Patki '15, SM '16, believe the company's success is due to SDV's ability to revolutionize software testing.
SDV goes viral
In 2016, Veeramachaneni's group at the Data to ai Lab introduced a set of open-source generative ai tools to help organizations create synthetic data that matched the statistical properties of real data.
Companies can use synthetic data in place of sensitive information in programs while preserving statistical relationships between data points. Companies can also use synthetic data to run new software through simulations to see how it performs before releasing it to the public.
Veeramachaneni's group ran into the problem because it was working with companies that wanted to share their data for research.
“MIT helps you see all these different use cases,” Patki explains. “You work with financial and healthcare companies, and all of those projects are useful in formulating solutions across industries.”
In 2020, researchers founded DataCebo to create more SDV functions for larger organizations. Since then, the use cases have been as impressive as they have been varied.
With DataCebo's new flight simulator, for example, airlines can plan for rare weather events in a way that would be impossible using historical data alone. In another application, SDV users synthesized medical records to predict health outcomes for patients with cystic fibrosis. A team in Norway recently used SDV to create synthetic student data to assess whether various admissions policies were meritocratic and free of bias.
In 2021, data science platform Kaggle hosted a competition for data scientists who used SDV to create synthetic datasets to avoid using proprietary data. Approximately 30,000 data scientists participated, creating solutions and predicting outcomes based on realistic company data.
And as DataCebo has grown, it has stayed true to its MIT roots: all of the company's current employees are MIT alumni.
Supercharging Software Testing
Although its open source tools are used for a variety of use cases, the company is focused on increasing its traction in software testing.
“You need data to test these software applications,” says Veeramachaneni. “Traditionally, developers manually write scripts to create synthetic data. With generative models, built using SDV, you can learn from a sample of collected data and then sample a large volume of synthetic data (which has the same properties as real data), or create specific scenarios and edge cases, and use the data to test your application.”
For example, if a bank wanted to test a program designed to reject transfers from non-money accounts, it would have to simulate many accounts making transactions simultaneously. Doing that with manually created data would be time consuming. With DataCebo's generative models, customers can create any edge case they want to test.
“It is common for industries to have data that is sensitive to some extent,” Patki says. “Often when you're in a domain with sensitive data, you're dealing with regulations and Even if there are no legal regulations, it is in companies' best interests to be diligent about who has access to what and when. Therefore, synthetic data is always better from a privacy perspective.”
Scale synthetic data
Veeramachaneni believes DataCebo is advancing the field of what it calls synthetic enterprise data, or data generated from user behavior in large enterprise software applications.
“Enterprise data of this type is complex and not universally available, unlike linguistic data,” says Veeramachaneni. “When people use our publicly available software and report whether it works with a certain pattern, we learn many of these unique patterns and that allows us to improve our algorithms. From one perspective, we are building a corpus of these complex patterns, which is readily available for language and images. “
DataCebo also recently released features to improve the usefulness of SDV, including tools for evaluating the “realism” of the generated data, called SDMetrics Library as well as a way to compare the performance of the models called SDGym.
“It's about ensuring that organizations trust this new data,” says Veeramachaneni. “(Our tools offer) programmable synthetic data, meaning we enable companies to insert their specific knowledge and intuition to build more transparent models.”
As companies across industries rush to adopt ai and other data science tools, DataCebo is ultimately helping them do so in a more transparent and accountable way.
“In the coming years, synthetic data from generative models will transform all data work,” says Veeramachaneni. “We believe that 90 percent of business operations can be performed with synthetic data.”