Raw audio data
for pretraining foundation models

We are creating a dataset of unprecedented scale to unlock new model capabilities across hundreds of low-resource languages.

News

Announcing our $3.5 million seed round

We are backed by top investors including YCombinator, IVC and RTP.

The dataset we wish we had

Our team used to train SOTA voice models for low resource languages. The right voice data was such a pain, we decided to solve it full time.

  1. Massive scale, without skew

We wanted massive scale but without extreme domain imbalances — so base models are better generalists.

  1. Real conversations, real settings

We wanted more real people having real conversations in real settings. Current datasets and online scrapes had too much 'produced' content from movies, dramas, podcasts, vlogs, news, radio and music.

  1. Daily environments, rare environments

We wanted realistic and diverse acoustic environments. Person taking while driving a tractor, cooking in a kitchen, in public transit, operating forklift at a factory, hiking, in a plane, and more.

  1. Globally diverse, locally diverse

We wanted balanced representation across geography, occupations, age, income, accent, and language, so our base models would work great for more users and use-cases.

About us

We are ex engineers from Apple Siri, Amazon Alexa and AWS Bedrock.

We used to train voice models for low-resource languages. We feel your data pain.

Work with us to unlock new model capability

Buy existing data, or collaborate with us to shape how the data grows. Let's make voice the default computing interface.

Step 1: Request Sample

We will set up a quick call to understand your use case and then send you relevant data samples.

Step 2: Purchase Access

Enter a data license agreement for the dataset and use-cases your team needs.

Step 3: Receive data

For existing data, we will grant your team access within two to four days.

Step 4: Experiment with us

We partner with research teams to design pretraining data distributions for any use case.

Mission

Accelerate transition to voice interfaces in low resource languages

Voice interfaces in low-resource languages will be the first truly life-changing application of modern AI at scale.

13% of global population can't read. Text interfaces like apps and websites haven't worked for them. Voice AI will give them access to digital knowledge and services for the first time.

This will increase per-person productivity, uplifting entire GDPs.

But this opportunity is still untapped. At big tech and AI labs, low-resource languages are a side project, so progress is slow. We are here to accelerate that progress by making low-resource languages our only priority.

Take the next step

Get the data
researchers wish
they had

2025 Uplift AI

Take the next step

Get the data
researchers wish
they had

2025 Uplift AI

Take the next step

Get the data
researchers wish
they had

2025 Uplift AI