Synthetic data for AI

The benefits of AI are concentrated in areas where data is available.
Synthetic data promises to fill the gaps.

Will Douglas Heavenarchive page

February 23, 2022

Andrea D'aquino

Key players

Synthetic Data Vault, Syntegra, Datagen, Synthesis AI

Availability

Now

Last year, researchers at Data Science Nigeria noted that engineers looking to train computer-vision algorithms could choose from a wealth of data sets featuring Western clothing, but there were none for African clothing. The team addressed the imbalance by using AI to generate artificial images of African fashion—a whole new data set from scratch.

Such synthetic data sets—computer-generated samples with the same statistical characteristics as the genuine article—are growing more and more common in the data-hungry world of machine learning. These fakes can be used to train AIs in areas where real data is scarce or too sensitive to use, as in the case of medical records or personal financial data.

The idea of synthetic data isn’t new: driverless cars have been trained on virtual streets. But in the last year the technology has become widespread, with a raft of startups and universities offering such services. Datagen and Synthesis AI, for example, supply digital human faces on demand. Others provide synthetic data for finance and insurance. And the Synthetic Data Vault, a project launched in 2021 by MIT’s Data to AI Lab, provides open-source tools for creating a wide range of data types.

This boom in synthetic data sets is driven by generative adversarial networks (GANs), a type of AI that is adept at generating realistic but fake examples, whether of images or medical records.

Proponents claim that synthetic data avoids the bias that is rife in many data sets. But it will only be as unbiased as the real data used to generate it. A GAN trained on fewer Black faces than white, for example, may be able to create a synthetic data set with a higher proportion of Black faces, but those faces may end up being less lifelike given the limited original data.

Join us March 29-30 at EmTech Digital, our signature AI conference, to hear Unity’s Danny Lange talk about how the video game maker is using synthetic data.

Deep Dive

Artificial intelligence

Large language models can do jaw-dropping things. But nobody knows exactly why.

And that's a problem. Figuring it out is one of the biggest scientific puzzles of our time and a crucial step towards controlling more powerful future models.

Will Douglas Heavenarchive page

OpenAI teases an amazing new generative video model called Sora

The firm is sharing Sora with a small group of safety testers but the rest of us will have to wait to learn more.

Will Douglas Heavenarchive page

Google’s Gemini is now in everything. Here’s how you can try it out.

Gmail, Docs, and more will now come with Gemini baked in. But Europeans will have to wait before they can download the app.

Will Douglas Heavenarchive page

Providing the right products at the right time with machine learning

Amid shifting customer needs, CPG enterprises look to machine learning to bolster their data strategy, says global head of MLOps and platforms at Kraft Heinz Company, Jorge Balestra.

MIT Technology Review Insightsarchive page

Stay connected

Illustration by Rose Wong

Get the latest updates from
MIT Technology Review

Discover special offers, top stories, upcoming events, and more.

Synthetic data for AI

Key players

Availability

Get updates and offers from MIT Technology Review

Deep Dive

Artificial intelligence

Large language models can do jaw-dropping things. But nobody knows exactly why.

OpenAI teases an amazing new generative video model called Sora

Google’s Gemini is now in everything. Here’s how you can try it out.

Providing the right products at the right time with machine learning

Stay connected

Get the latest updates from
MIT Technology Review

The latest iteration of a legacy

Advertise with MIT Technology Review

About

Help

Key players

Availability

Deep Dive

Artificial intelligence

Large language models can do jaw-dropping things. But nobody knows exactly why.

OpenAI teases an amazing new generative video model called Sora

Google’s Gemini is now in everything. Here’s how you can try it out.

Providing the right products at the right time with machine learning

Stay connected

Get the latest updates fromMIT Technology Review

Get the latest updates from
MIT Technology Review