Faster, Better, Smaller: Arcee AI works with DatologyAI to release their foundation model

Executive Summary

Arcee AI was started in 2023 to make world-class small language models (SLMs) available to companies across all industries. Originally, Arcee AI had applied their technology on top of public models, but as their products continued to mature, they had an increasingly urgent business need to train their own in-house model. Their customers were limited on the existing models they could adopt, due to regulatory concerns, data safety and sovereignty issues, and expense to operate and scale across an enterprise. The Arcee AI team began evaluating what they needed to build and support an in-house model from scratch. Curating their pre-training data to optimize the model would be critical for their model release and was a top priority for the team.

The Arcee AI team decided to partner with DatologyAI to curate their pre-training data in order to get a better model, faster, accelerate their model deployment timeline, and level-up their model quality. DatologyAI’s team of world-class researchers and engineers worked together with Arcee AI to curate a 8 trillion token dataset, using the DatologyAI’s suite of curation algorithms that outperformed competition from models made by teams of researchers that were trained on larger datasets.

Client Overview

Arcee AI
Software Development
11 - 50 employees
Arcee AI enables enterprises to build and deploy custom large language models using their own data, ensuring secure, domain-specific, and high-performance AI solutions.

The Challenge

Arcee AI set out to build a foundation model that addressed the regulatory issues, performance gaps, and outdated models currently available to their customers. However, building out a model that addressed all of these issues required the best training data available. Data curation is a frontier research problem that requires a large commitment of resources and a world-class team to scale and train models. Trying to roll their own pretraining data curation—let alone make it state-of-the-art—would have demanded time, resources, and scope well beyond what was feasible for the Arcee AI team.

“Using off-the-shelf datasets, and what little filtering and curation we knew how to do, wouldn’t have allowed us to create a meaningfully better model than other open models on the market.” – Lucas Atkins, Arcee AI CTO

The Solution

The Arcee AI team began looking for a solution to ensure they were building their model on the best training data available. Getting the right team to support this curation was critical to helping Arcee AI hit their planned release schedule and deliver the model performance benchmarks they knew were possible. After conversations that began in January, Arcee AI selected DatologyAI for an initial POC to leverage DatologyAI’s team of world-class researchers and proprietary curation algorithms.

Working with DatologyAI meant that Arcee AI was partnering with the leading experts in data curation and industry-changing proprietary algorithms. DatologyAI’s curation pipeline leveraged model-based quality filtering, embedding-based curation, target distribution-matching, source mixing, and synthetic data to generate a dataset which enabled a model that outperformed existing models that were built on much larger datasets by much larger teams. The POC was so successful that a long-term partnership was established between Arcee AI and DatologyAI to support several future milestone releases.

“It is a relief to be able to focus on what we’re really good at – infrastructure, model customization, and post-training – and know that I have a partner in DatologyAI that’s consistently making the data better.” – Lucas Atkins

The Results

Arcee AI released AFM-4.5B, their debut model that used a fraction of the compute and resources of other labs and is competitive with the frontier labs. Because of the data curation algorithms and platform DatologyAI provided for the pre-training data, Arcee AI’s model used an 8-trillion-token dataset that outperformed existing models with 4x less training data. AFM-4.5B wasn’t just better on benchmarks, but also in conversational use cases. Yupp, a head-to-head-style LLM leaderboard for user preference, consistently ranks AFM-4.5B on par with much larger models like Qwen 3 14B, Grok 3 Mini, and Llama 3.1 70B Instruct. DatologyAI’s data curation platform enabled AFM-4.5B to perform better, train on a more efficient dataset, and get the model right on the first time, saving Arcee AI time, money, and resources that allowed for an accelerated launch timeline.

“DatologyAI is going to do everything they can to be sure the data I’m training on and then the results I’m getting are as good as possible.” – Lucas Atkins

In addition to delivering the DatologyAI platform that improves Arcee AI’s pre-training model performance, DatologyAI became an extension of the Arcee AI team. They reviewed every step to ensure everyone was aligned on the end goals. They provided support at every decision and stepped in to help address problems that came up throughout the process.

“It really was like expanding my research team with an extra 30 world-class researchers…DatologyAI does everything they can to ensure the results are as good as possible.” -Lucas Atkins

They delivered the complete dataset a week early, and the Arcee AI team was able to see immediate results that allowed them to launch their foundation model on time and with significant improvements from existing solutions.

Key Takeaways

Arcee AI’s vision is to offer the efficiency, scalability, and security requirement for their customers as a launchpad for a fast, reliable deployment for enterprise environments. Arcee AI’s team was able to build their most advanced model by building it on the best training data available in the market through DatologyAI’s foundation model curation platform. By working with DatologyAI, Arcee AI was able to eliminate unknowns in the training process and successfully launch their model on schedule and outperform or compete with existing options in the market.

“The payoff was immediate…working with Ari and the DatologyAI team exceeded every expectation. We couldn’t ask for better data partners, and you absolutely should work with them.” – Lucas Atkins

Looking Forward

Arcee AI’s close collaboration with DatologyAI has opened the door to training larger, more advanced models and redefining their product roadmap. Because of the success of the curation, Arcee AI is working on several projects that are already expanding the scope of their training plans and accelerating their release timelines. Together, the teams have built on the success of AFM-4.5B, the POC, and their long-term partnership to expand what Arcee can deliver for enterprises seeking scalable, affordable solutions.

Stay tuned as we move into the next phase of the expanded partnership, which also includes working with the team from Prime Intellect: https://x.com/PrimeIntellect/status/1966207652654149694

Faster, Better, Smaller: Arcee AI works with DatologyAI to release their foundation model

Executive Summary

Client Overview

The Challenge

The Solution

The Results

Key Takeaways

Looking Forward

Learn more

BeyondWeb: Lessons from Scaling Synthetic Data for Trillion-scale Pretraining

🌞 Summer of Data Seminar: Diving Deep into Data Research

CLIP Gets a Data Upgrade: Outperforming SoTA with Improved Data Curation Only

Ready for better data?