Troveo Accelerates AI Model Development, Expands AI Training Data Platform to Five New Categories, Announces $20 Million in Payouts

Troveo, the world’s largest provider of licensed real-world data for AI, today announced a major expansion of its platform into five new data categories. Troveo is accelerating AI development by providing training-ready datasets at scale.

Troveo has now paid out more than $20 million to content owners, underscoring strong demand from AI labs and model builders for licensed, rights-cleared training data not available on the public internet.

Access to training data is a key bottleneck for developing next-generation frontier models. Troveo’s labeled datasets allow labs to train much more efficiently and reach significantly higher quality model performance. Troveo can also allow labs to reach aggressive deadlines by delivering the data immediately.

This data is difficult to access and often lives in broadcast archives, studio vaults, enterprise systems and private collections. Accessing it requires relationships, infrastructure and legal groundwork.

Troveo has proven this model with video, building a library of more than 8 million hours of licensed video footage.

Now Troveo is expanding into five new domains:

  • Audio: Four million hours of audio of single and multi-channel audio across dozens of languages and dialects, which is used for the development of voice-based models including automatic speech recognition, voice assistance and conversational AI.

  • Text: Billions of words sourced from publishers and other rights holders, which is structured for training, fine-tuning and evaluation.

  • Agentic trajectories (enterprise workflow traces): Real-world business data sourced directly from companies across a range of industries that captures actual enterprise workflows.

  • Gameplay: Video game data, including time-synced keystroke and character progression metadata used for frontier world models.

  • Egocentric robotics: Real-world, first-person perspective data from real operating environments that power the robotics world.

“Beyond access to compute and top-tier talent, training data remains the biggest bottleneck for building the next generation of AI models. The most valuable data for solving that is real-world, meaning it captures the complexity of how people actually live and work,” said Marty Pesis, founder and CEO of Troveo. “It is clean, accurately labeled and ready to train on. And it’s non-public, meaning it has not been incorporated into a prior training run. It lives in archives, hard drives and operating environments that nobody has indexed or packaged for AI. Troveo delivers this data directly into the training environments of the worlds’ top labs.”

Legal and competitive pressures around training data have intensified across the AI industry. More model builders are seeking data pipelines that are legally defensible and traceable to rights holders. Every dataset in Troveo’s library is sourced and licensed from content owners. Troveo works with thousands of content owners and has active relationships with AI labs and model builders across the industry, including some of the largest technology companies in the world.

Troveo will continue to release new datasets on a regular cadence across all six categories. The full catalog is available at troveo.ai.

About Troveo

Troveo builds the data infrastructure that AI labs and model builders need to train the next generation of models. The company has built the world’s largest licensed platform of scarce, proprietary data for AI, spanning video, audio, text and business workflows. Troveo indexes, enriches and packages that data into formats ready for training, fine-tuning, evaluation and agentic use cases. Learn more at troveo.ai.

Media gallery