Disappearing data.

So sensitive. 

Funding goes plop.

 

View in browser

September 24, 2024

Mocktail 

Hi there, 

 

We are running out of high-quality data to train LLMs. 

 

That scarcity is driving up demand for synthetic data — artificially generated datasets such as text and images — to supplement model training. 

 

Enterprises are also leveraging synthetic data in domains where data availability is limited or privacy is a concern. 

 

Below, we look at 4 things you need to know about the market: the funding dropoff, big tech competition, privacy needs, and international demand. 

 

1) Synthetic training data startups hit speed bumps 

 

Since 2022, nearly 30 of the 50 vendors we identified in the synthetic training data space have raised equity funding. 

 

But funding has since fallen off. What happened?

 

a) Generative AI has upended the business models of data simulation companies founded before LLMs’ mainstream arrival. 

 

For example, Datagen, which was founded in 2018 and raised $72M, shut down in 2024 after it failed to adapt to advancements in diffusion models.

 

b) Big tech and model developers like OpenAI are generating their own synthetic data — and building developer applications. As a result, young startups are finding a chillier reception from investors (more on this later).

Synthetic-media

2) Data-sensitive industries turn to synthetic data to ensure privacy and regulatory compliance 

 

For example, the head of AI at a Fortune 500 company explained the importance of synthetic data’s privacy as a key differentiator in model training platforms in a recent interview with CB Insights:

 

“You have synthetic data generation models or vendors that create completely new datasets off of the dataset that you have internally, but it’s statistically identical, but has completely generated and made up data, so you don’t have to use any customer or employee data, and it reduces your risk of PII.”

 

3) Data generation startups will face stiff competition from big tech

 

Tech giants like Microsoft, Google, and Meta are all generating synthetic data to train their models. 

 

Others are building applications for developers. 

 

In July 2024, IBM and Red Hat announced the release of a new open-source tool, InstructLab, which generates synthetic data for model tuning.


“We also recently launched InstructLab, a tool for more rapid model tuning through synthetic data generation, allowing our clients to more efficiently customize models using their own data and expertise.” — Arvind Krishna, CEO of IBM, Q2’24 Earnings Call

Screenshot 2024-09-23 at 12.07.44 PM

Similarly, in June 2024, Nvidia released Nemotron-4 340B, a family of models that can generate synthetic data to train LLMs for commercial applications. 

 

4) International demand creates opportunities

 

While funding has slowed and big tech is moving in, a number of global synthetic data startups — like Italy-based Aindo and UK-based Synthesized — have been growing their headcount in the last year. 

 

Overall, 74% of synthetic data companies with growing or flat headcounts in the last year (and a minimum of 10 employees) are located outside of the US. 

 

That indicates the international need for privacy-preserving, local data solutions amid emerging regulations like the EU’s AI Act. 

blur-search

CB Insights customers can learn more in this brief.

    I love you.

     

    Anand

    @asanwal 

    Co-Founder & Exec Chair

     

    P.S. Join our analysts for the VC outlook on October 8. Get insights into the sectors gaining the most momentum and markets to watch.

    Get started with CB Insights

    Start your free trial

    CB Insights' emerging technology insights platform provides all the

    analysis and data from this newsletter. Our data is the easiest way to discover and respond to emerging tech. 

    Was this email forwarded to you? Sign up here

    X
    LinkedIn
    CB-Insights-Icon-Light

    Copyright © 2024 CB Insights, All rights reserved.

    498 7th Avenue, NY, CB Insights, New York,10018

    About Us | Update Preferences | Research | Newsletter