Datology AI
Customer Case Studies

Arcee AI Builds Frontier Open Models with DatologyAI’s Data Curation Platform

Arcee AI partnered with DatologyAI to leverage large-scale data curation, enabling the rapid development of a frontier-level open-weight model that achieves top-tier performance at a fraction of the cost of competing systems.

“In the span of a year, we went from never having pretrained a model to releasing what is currently the strongest open weight model the West has ever seen. Datology was a key partner in that transformation. Their ability to curate and synthesize 20 trillion high-quality tokens helped us concentrate on the core challenges of building at scale: infrastructure, reliability, and efficiency.

The result: stronger performance with significantly less data than conventional approaches demand. If you're building your own models, you need to work with DatologyAI.”

Lucas Atkins, CTO, Arcee AI

Results

  • Over 3 trillion tokens served in the first two months on OpenRouter
  • Exceptional agentic performance: Among evaluated models (Opus 4.6, MiniMax-M2.7, GLM-5, Kimi-K2.5), Trinity Large Thinking achieves #1 on Tau2-Airline (88.0) and #2 on PinchBench (91.9) and AIME25 (96.3), trailing only Claude Opus 4.6 in each case
  • Competitive with frontier models across reasoning, coding, and instruction-following benchmarks
  • ~96% lower inference cost than the top-ranked model on PinchBench ($0.90/M output tokens vs. $25/M for Opus 4.6)
  • Open-weight model competitive with systems 10–15x larger
About the Client:
Industry: Artificial Intelligence & Foundation Models
Domain: Open-weight LLMs, Enterprise AI
DatologyAI Product: Large-scale Pre-training data curation
blog_arcee_desktop.png

Leveraging Competitive Advantages

Arcee entered the market with several key strengths: a clear focus on open, sovereign AI, deep expertise in model infrastructure and training, and a commitment to efficient, production-ready systems. However, as they transitioned from fine-tuning existing models to training foundation models from scratch, a core challenge emerged:

How do you build models that can compete with frontier systems without access to hyperscaler-scale data pipelines or compute budgets?

While model architecture and training infrastructure were strong, data quality and curation at scale became the limiting factor. Like many teams at this stage, Arcee faced a familiar decision to invest heavily in building a data curation pipeline internally, or partner with a team that had already solved this problem at scale.

They chose to partner with DatologyAI.

Data Curation Strategy

Arcee partnered with DatologyAI to develop a high-performance pre-training dataset optimized for both general capability and downstream usability.

DatologyAI applied its full data curation stack, including:

  • Model-based quality filtering
  • Embedding-driven data selection
  • Distribution balancing and source mixing
  • Synthetic data generation and augmentation

This resulted in a 3-phase, 20 trillion token pre-training dataset intended to provide exceptional general, programming, STEM, reasoning, and multilingual capabilities. This included over 8 trillion tokens of synthetic data generated across web, code, math, reasoning, and multilingual domains, using a breadth of state-of-the-art rephrasing approaches.

From Experiment to Frontier in Nine Months

Arcee's first foundation model, AFM-4.5B (July 2025), was a dense 4.5B model trained on 8 trillion DatologyAI-curated web tokens. AFM-4.5B validated the approach, and justified Arcee AI doubling down on pre-training investment and shifting their strategy to release a model that could perform at the frontier level. This led to Trinity Nano (6B total, ~800M active) and Trinity Mini (26B total, 3B active), released in December 2025, alongside Arcee's public commitment to training frontier models from scratch. The Trinity Nano and Mini pre-training dataset grew to 10 trillion tokens with curated math and code integrated from the start.

The same principles scaled to Trinity-Large-Thinking (~400B parameters, 13B active per token). DatologyAI generated over 8 trillion synthetic tokens—on clusters peaking at 2,048 H100 GPUs—paired with 12 trillion curated web tokens. Across three model generations, the pre-training corpus grew from 8T to 10T to 20T tokens, with each iteration incorporating deeper curation and broader domain coverage.

Results: World-Class Performance at Open-Weight Economics

The bet paid off. Trinity-Large-Thinking is now the most-used open model in the U.S. on OpenRouter, serving 3.37 trillion tokens in its first two months.

The model's standout results are on the benchmarks that matter most for production agentic workloads:

  • Tau2-Airline (88.0): Best overall among all models tested—ahead of Opus 4.6, GLM-5, MiniMax-M2.7, and Kimi-K2.5. Tau2 evaluates complex, multi-turn task completion in constrained domains—exactly the kind of workload enterprises deploy agents for.
  • PinchBench (91.9): Second only to Opus 4.6, PinchBench is the leading benchmark for agentic model capability, measuring multi-step tool calling and instruction following across long-running loops.
  • AIME25 (96.3): Second only to Opus 4.6, demonstrating frontier-level mathematical reasoning.
  • Tau2-Telecom (94.7): Second only to GLM-5, reinforcing strong domain-specific task performance.
Updated Arcee AI Trinity Benchmarks.png

Across the full benchmark suite, Trinity holds its own against the best closed models in the world:

Critically, Trinity achieves this performance at approximately 96% lower cost than the top-ranked closed model, with fully open weights under Apache 2.0. This is not a model that’s merely "near" the frontier, it’s solidly world-class, and the strongest fully open reasoning model built in the United States.

The Compounding Effect of Data Quality

One of the most important takeaways from this collaboration is how data quality compounds across the model lifecycle. Each generation of Arcee models benefited from deeper, broader, and more sophisticated data curation, and each generation delivered disproportionately better results. AFM-4.5B proved the approach and Trinity Nano and Mini validated that adding highly-curated math and code to pre-training unlocked capabilities that post-training alone could not reach. Trinity Large demonstrated that these principles hold at frontier scale.

High-quality pre-training data also increased the effectiveness of downstream post-training, reduced the need for excessive scale, and enabled faster iteration cycles. For teams building foundation models, this represents a critical leverage point: the value of every stage of the pipeline increases when the underlying data is better.

Looking Forward

The partnership between Arcee and DatologyAI demonstrates what becomes possible when diligent model design and world-class post-training meet high-quality pre-training data curation. In nine months, Arcee went from a 4.5B dense experiment to a frontier-quality, open-weight 400B MoE. 

Together, Arcee and DatologyAI are continuing to push toward a new standard where open models built in the U.S. deliver frontier performance without frontier costs.


Ready for better data?

Let’s make models better through better data, automatically.

Book a Call