Protege launches DataLab, a research institution treating AI training data as a scientific discipline — with five of the world’s top AI firms already on board.
For years, the story of AI progress has been told in two currencies: model size and compute power. More parameters, more chips, faster training runs. The assumption was that scale, applied relentlessly, would keep driving results.
That assumption is running into its limits. And the constraint isn’t the model anymore. It’s the data.
On Thursday, Protege, an AI data platform, announced the launch of DataLab — a research institution built to bring scientific rigor to the AI data layer. At launch, a majority of the so-called “Magnificent 7” AI companies and several major frontier AI labs are already collaborating with DataLab across training and evaluation data projects.
The Problem DataLab Is Built to Solve
As AI models grow more advanced, progress depends not only on scale but on access to high-quality, carefully curated training data. Yet the standards, methodologies, and reproducible practices that govern how that data is built, validated, and measured have lagged far behind the sophistication of the models being trained on it.
The result is a structural gap. Enormous investment flows into model architecture and chip development. Far less flows into the rigorous, scientific treatment of the data that feeds those systems — even though the quality of that data increasingly determines whether a model performs reliably in the real world or fails in ways that are difficult to trace and harder to correct.
“We understand the three core pillars driving AI: models, chips, and data,” said Bobby Samuels, Protege’s chief executive. “We are convinced that with the right datasets — the third, underdeveloped pillar — you can push the entire frontier forward. We created DataLab to treat data as infrastructure, not exhaust.”
Also Read: The AI Attack You Haven’t Heard Of. But Should.
What DataLab Actually Does
DataLab operates across three areas. It engages directly with leading AI researchers to navigate frontier-level technical challenges and identify commercially viable pathways. It develops high-value datasets and data products through methodological discipline and rigorous process. And it maintains an active presence in the academic community — publishing data research, designing evaluations and benchmarks, and identifying gaps in today’s training and evaluation data.
The institution is led by Engy Ziedan, co-founder and Chief Scientific Officer at Protege, and brings together machine learning researchers, economists, and domain experts with experience in evaluation, dataset design, and applied AI systems.
“Advancing AI requires thinking at the margin,” Ziedan said — weighing the value of a single data point on learning and the cost of choosing the wrong dataset. “This requires disciplined dataset design, careful evaluation, and a deep understanding of real-world complexity.”
Since its launch, DataLab has released multimodal healthcare benchmark datasets designed to reflect diagnostic ambiguity and longitudinal clinical context, and co-designed two multimodal benchmarks for healthcare — MedScribe and Medcode. It is also collaborating with frontier AI organizations on high-stakes data challenges ranging from advanced cancer research to agentic task selection, audio de-identification, and international healthcare representation.
Also Read: Cutting Through the Noise of SaaS Buying
Why the Timing Matters
The launch arrives at a moment when AI systems are moving out of research environments and into high-stakes, real-world applications — a transition that makes the quality of the underlying data foundation decisively important. A model that performs well on benchmarks but was trained on poorly curated, unrepresentative, or methodologically inconsistent data will fail in ways that are difficult to predict and costly to fix.
Nikhil Basu Trivedi, co-founder and general partner at Footwork, one of Protege’s backers, framed the gap plainly. “Data quality has become the defining constraint in frontier AI development, yet investment and innovation have lagged,” he said. “DataLab brings the same level of rigor and expertise to AI data that we have for AI chips and models.”
That comparison is instructive. The semiconductor industry has decades of accumulated standards, testing methodologies, and quality controls. The AI data layer, despite underpinning every model trained on it, has had almost none of that. DataLab is making a bet that the industry is ready to change that — and that the labs already at the table suggest the bet is well-timed.
DataLab is inviting collaboration from frontier labs, academic researchers, and domain experts. More information is available at datalab.withprotege.com.


