Thomson Reuters Improves Model with DatologyAI’s Data Curation

"DatologyAI delivered clear, measurable improvements across both public and our proprietary legal evaluations, outperforming baseline models in legal reasoning, retrieval, and downstream tasks. What’s particularly compelling is that these gains were achieved with minimal data budget and with minimal information about our evaluation suite - demonstrating the strength and generalizability of DatologyAI's approach. As we scale this collaboration with deeper integration of Thomson Reuters’ proprietary data and workflows, we are very excited to continue to elevate model performance, efficiency, and real-world impact." - Jonathan Schwarz, Head of AI Research, Thomson Reuters

Results

+5% on legal, +2.5% on general-purpose evals after mid-training
>2.5x improvement in post-training gains on Thomson Reuters’ private legal evals after mid-training on DatologyAI-curated legal data mix
Mid-training token budget <1% of base model pre-training token budget

Thomson Reuters holds one of the largest proprietary legal databases in the world. The challenge was unlocking the full value of that data to build models that could genuinely outperform public alternatives and their current internal best.

About the Client:
Industry: Legal & Information Services
Domain: Legal and Professional Services
DatologyAI Product: Legal Domain Adaptation Mid-training

Leveraging Competitive Advantages

Thomson Reuters entered the AI era from a position of genuine strength. They had proprietary data that no competitor could replicate, a sophisticated in-house ML team, and an established post-training pipeline that used their proprietary data and domain expertise.

Post-training was producing meaningful gains, but in today’s competitive landscape, the team wanted to explore how they could further improve performance for their legal capabilities while leveraging their data. To continue building broadly capable Foundation models that could outperform both large-lab generic models and emerging legal AI startups, they needed to extract more signal from their data in addition to the gains that post-training produced.

The practical constraints to improve their legal capabilities were tight: any solution had to excel at real-world legal use cases without sacrificing general capabilities, and meaningfully outperform their existing systems. The path forward required getting the most out of their proprietary data, and doing so at a scale that post-training could not achieve on its own.

When Thomson Reuters’ ML leadership concluded they needed to build the best legal reasoning model, they realized they needed to explore leveraging their proprietary data beyond post-training alone. They faced a choice familiar to anyone in this position: invest engineering time building data curation infrastructure in-house, or find a partner who had already solved this problem at scale. They chose the latter, and the evaluation was rigorous.

Data Curation Strategy

Thomson Reuters turned to DatologyAI with the goal of curating a legal domain adaptation mid-training dataset that could propel their in-house foundation model far beyond what was achievable with post-training alone. DatologyAI uses a suite of algorithms to identify and extract the highest quality, most relevant data, and synthetically augmenting and enhancing the data to improve both general and domain-specific model capabilities.

DatologyAI applied their data curation pipeline and research expertise to Thomson Reuters’ legal data to generate a 100B token dataset tailored towards Thomson Reuters’ use cases. The testbed was an 8B model, which had been pre-trained on 15T tokens. As with Thomson Reuters’ previous approach, the mid-trained model was then post-trained using Thomson Reuter’s proprietary data via SFT and DPO.

Improving Proprietary Legal Performance with Better Data

Despite mid-training for less than 1% of the pre-training budget of the base model (100B tokens vs 15T tokens), the performance improvements were substantial:

+5% on LegalBench, +2.5% on general-purpose evals after mid-training
>2.5x improvement in post-training gains on Thomson Reuters’ private legal evals after mid-training on DatologyAI-curated legal data mix
Mid-training token budget <1% of base model pre-training token budget

The amplification effect on post-training is worth pausing on. The same post-training procedure applied to the mid-trained model produced more than twice the gains it had produced on the base model. This suggests that well-curated data changes what post-training can achieve. For organizations that have already invested in post-training pipelines and proprietary training data, this is a significant leverage point: the value of existing post-training infrastructure scales with the quality of the model it operates on.

Looking Forward

The results from this initial partnership demonstrate something important: the combination of Thomson Reuters’ unmatched legal corpus and DatologyAI’s data curation expertise produces compounding advantages that neither could achieve alone. What began as a focused mid-training engagement has revealed the full scope of what becomes possible when proprietary data is properly leveraged across the model development pipeline.

Together, DatologyAI and Thomson Reuters are working on the next generation of domain-specific AI: models that are more efficient, more capable, and better grounded in the knowledge that Thomson Reuters has built over decades across capabilities, and that set a new standard for delivering the trusted, world-class services that Thomson Reuters is known for.

Thomson Reuters Partners with DatologyAI to Break Through the Post-training Ceiling

Results

Leveraging Competitive Advantages

Data Curation Strategy

Improving Proprietary Legal Performance with Better Data

Looking Forward

Learn more

BeyondWeb: Lessons from Scaling Synthetic Data for Trillion-scale Pretraining

Thomson Reuters Partners with DatologyAI to Break Through the Post-training Ceiling

The Finetuner's Fallacy: When to Pretrain with Your Finetuning Data

Ready for better data?