DatologyAI: Train Better Models, Faster and Smaller

DatologyAI today announced a partnership with Thomson Reuters to advance domain-specific legal AI by enabling Thomson Reuters to mid-train custom models and unlock the full value of their proprietary data through data curation. The collaboration produced substantial performance gains.

Click here for the case study and additional details on the partnership.

Collaboration Results

+5% on legal, +2.5% on general-purpose evals after mid-training
>2.5x improvement in post-training gains on Thomson Reuters’ private legal evals after mid-training on DatologyAI-curated legal data mix
Mid-training token budget <1% of base model pre-training token budget

Jonathan Schwarz, Head of AI Research at Thomson Reuters says, "DatologyAI delivered clear, measurable improvements across both public and our proprietary legal evaluations, outperforming baseline models in legal reasoning, retrieval, and downstream tasks. What’s particularly compelling is that these gains were achieved with minimal data budget and with minimal information about our evaluation suite - demonstrating the strength and generalizability of DatologyAI's approach. As we scale this collaboration with deeper integration of Thomson Reuters’ proprietary data and workflows, we are very excited to continue to elevate model performance, efficiency, and real-world impact."

As competition in legal AI accelerates, Thomson Reuters sought new ways to further leverage its proprietary corpus to build models that outperform both generic large-scale models and emerging domain-specific AI systems. To pursue this approach, Thomson Reuters partnered with DatologyAI to curate a large-scale, legal mid-training dataset tailored to Thomson Reuters’ real-world legal use cases.

DatologyAI's CEO, Ari Morcos, says, "We are thrilled to partner with Thomson Reuters and demonstrate how companies can leverage proprietary data to outperform general-purpose models available on the market. We see these initial results as just the beginning, and our team is excited for the upcoming collaboration.”

The partnership highlights a powerful opportunity for organizations with large proprietary datasets: combining domain-specific data assets with advanced data curation and mid-training techniques can unlock new levels of model performance while maximizing the value of existing AI infrastructure.

DatologyAI and Thomson Reuters Partner to Advance Domain AI Models

Collaboration Results

Learn more

BeyondWeb: Lessons from Scaling Synthetic Data for Trillion-scale Pretraining

Thomson Reuters Partners with DatologyAI to Break Through the Post-training Ceiling

The Finetuner's Fallacy: When to Pretrain with Your Finetuning Data

Ready for better data?