ÜberWeb: Insights from Multilingual Curation for a 20-Trillion-Token Dataset
Turning the curse of multilinguality into a blessing with principled data curation
*NOTE: If you prefer a PDF, the ArXiv version of this post can be found here.
Executive Summary
Multilinguality is a core capability for modern foundation models, yet training high-quality multilingual models remains challenging due to uneven data availability across languages. A further challenge is the performance interference that can arise from joint multilingual training, commonly referred to as the “curse of multilinguality”. We study multilingual data curation across thirteen languages spanning multiple scripts, language families, and resource levels, showing that many reported regressions are not inherent to multilingual scaling but instead stem from correctable deficiencies in data quality and composition rather than fundamental capacity limits. In controlled bilingual experiments, improving data quality for any single language benefits others: curating English improves non-English performance on MMLU, ARC-Challenge, and Belebele in 12 of 13 languages (3.9% average relative gain), while curating non-English yields reciprocal improvements in English (1.2% average gain). Bespoke per-language curation produces substantially larger within-language improvements, with up to 16.9% relative gains over uncurated baselines. Extending these findings to large-scale general-purpose training mixtures, we show that curated multilingual allocations comprising under 8% of total tokens remain remarkably effective. We operationalize this approach within a broader large-scale effort that produced a 20T-token pretraining corpus derived entirely from public sources. Models with 3B and 8B parameters trained on a 1T-token random subset achieve competitive multilingual accuracy with 4–10× fewer training FLOPs than strong public baselines, establishing a new Pareto frontier in multilingual performance versus compute (Figure 1). Moreover, these benefits extend to frontier model scale: the 20T-token corpus served as part of the pretraining dataset for Trinity Large (400B/A13B), which exhibits strong multilingual performance relative to its training FLOPs. Together, these results show that targeted, per-language data curation mitigates multilingual interference and enables compute-efficient multilingual scaling.
The future is already here – it's just not evenly distributed.
— William Gibson
1. Introduction
Large language models (LLMs) have fundamentally reshaped the landscape of artificial intelligence, yet their benefits remain unevenly distributed across languages. Although modern models have demonstrated remarkable capabilities in English, these capabilities often degrade substantially when applied to non-English settings [48, 37]. Bridging this gap is not merely an architectural challenge, but fundamentally a data-centric one: training on large volumes of high-quality data is essential for achieving frontier-level model capabilities. English benefits from multiple large-scale, carefully curated public corpora [44, 69, 38, 43], whereas multilingual corpora are far more fragmented. Many non-English languages occupy a long tail characterized by limited, noisy, or inconsistently curated data, constraining multilingual model performance regardless of architectural capacity.
Beyond uneven data availability across languages, multilingual modeling faces an additional distinct challenge: the so-called "curse of multilinguality" [23, 36]. This term refers to the empirical observation that training a single model across an increasing number of languages often leads to degraded per-language performance, even under comparable training budgets. Historically, the consensus view attributed this phenomenon to a capacity bottleneck, framing multilingual modeling as a zero-sum game in which distinct languages compete for finite parameters or model capacity [2, 23, 36]. Under this paradigm, the primary solutions have been to scale model size [14, 35] or increase the number of training tokens [30], both of which substantially increase the computational cost of multilingual training. Such strategies, however, assume access to abundant, high-quality multilingual text, bringing them into direct tension with the uneven data availability across languages discussed above.
Recent evidence suggests this capacity-centric view is incomplete. Emerging research indicates that the "curse" may stem less from parameter scarcity and more from the interference caused by suboptimal data quality. Both [42] and [21] demonstrate that the trade-off between English and multilingual performance is not inevitable; they find that replacing significant portions of English data with high-quality multilingual text need not degrade English capabilities. Similarly, advances in model-based data selection [50] and systematic filtering pipelines [12] reveal that when data quality is rigorously controlled, models can accommodate significantly more linguistic diversity without quality degradation. Taken together, these findings suggest that the apparent capacity constraints in multilingual scaling are often induced by low-quality data. This motivates a shift in emphasis: optimal scaling of multilingual capabilities requires intentional, multilingual-targeted data curation.
In this work, we study multilingual foundation model training through the lens of data curation, arguing that careful curation can simultaneously address the two central challenges of multilingual modeling: limited high-quality data for many languages and performance interference arising from joint multilingual training. By improving data quality, curation enhances cross-lingual transfer, reducing the amount of language-specific data required to achieve strong performance. The complement to this phenomenon is that targeted multilingual curation mitigates interference effects, alleviating the curse of multilinguality without relying solely on increasing compute. We validate these claims through controlled 60B-token bilingual studies, large-scale 1T-token pretraining, and frontier-scale pretraining at multi-tens-of-trillions of tokens.
We summarize the key contributions of this work as follows:
- Cross-lingual transfer improves with data quality. We demonstrate that refining data quality drives significant cross-lingual performance gains through controlled bilingual experiments with 3B-parameter models trained on 60B tokens. Crucially, we find this relationship is bidirectional: enhancing the quality of English data improves non-English performance in 12 out of 13 examined languages, yielding an average relative improvement of 3.91% across multilingual evaluations, while improving the quality of non-English data benefits English capabilities in 12 out of 13 languages, with an average relative improvement of 1.21% on English evaluations.
- Optimal performance requires bespoke multilingual curation. While English data curation does improve multilingual capabilities, we find that the best performance is obtained when tailored curation pipelines are built for each language. Our findings highlight that English-centric curation strategies cannot be applied blindly to other languages. Instead, it is imperative to construct tailored pipelines designed for each language's specific needs. For our 3B-parameter models trained on 60B tokens, while English curation alone drove the aforementioned 3.91% relative improvement, applying bespoke language-specific curation yielded a significantly higher 16.87% relative improvement over the uncurated baseline.
- Data quality persists through translation. Building on findings of recent large-scale translation efforts [51, 22], we explore various strategies to translate English data into non-English languages. Large-scale translation provides a mechanism for expanding training data across languages, but we find that the choice of source data critically determines its effectiveness. We observe that prioritizing high-quality English documents for translation can significantly boost performance over translations of arbitrary English documents. In experiments with 3B-parameter models, we find that augmenting the uncurated baseline with translations of random English data yields marginal gains, whereas translating high-quality, score-filtered English data leads to an average relative improvement of 5.09% over an uncurated baseline. Moreover, we find that translation is most effective when embedded within a holistic, per-language curation framework, which yields the strongest overall performance.
- Curation makes multilingual scaling remarkably compute-efficient. Under a 1T-token training budget drawn from a curated general-purpose pretraining corpus, we find that allocating approximately 8% of tokens to high-quality multilingual data (~80B tokens across 13 languages) is sufficient to achieve very strong multilingual performance, in many cases comparable to or exceeding competitive open-weight models. Our 3B and 8B models trained for 1T tokens achieve 4-10x greater training FLOPs efficiency than strong public baselines. For example, a DatologyAI 3B model trained for 1T tokens (1.8 x 10^22 FLOPs) outperforms LFM-2.5-1.2B [141], a 1.2B model trained for 28T tokens (1.9 x 10^23 FLOPs). Similarly, a DatologyAI 8B model trained for 1T tokens (4.8 x 10^22 FLOPs) outperforms SmolLM3-3B [6] and Granite-4.0-3B [5], both trained with an order of magnitude more compute. Importantly, these efficiency gains persist at frontier scale: our multilingual curation framework forms part of the 17T-token pretraining corpus for Trinity Large Base (400B-parameter MoE, 13B active; [39]), which exhibits exceptionally strong multilingual performance for its training FLOPs budget.
Our results demonstrate the critical role of multilingual data curation for multilingual model capability. From controlled 60B-token studies to 1T-token training mixtures to 17T-token frontier-scale pretraining, we show that language-aware improvements in data quality systematically enhance cross-lingual transfer, mitigate multilingual interference, and substantially improve within-language performance. These effects collectively shift the performance-compute Pareto frontier (see Figure 1). Together, these findings position high-quality, per-language curation as a practical and scalable mechanism for compute-efficient multilingual foundation model training, advancing progress toward more language-inclusive foundation models and a more evenly-distributed future.
2. Related Work
Multilingual Data Curation. The field has moved from scale-focused web ingestion [2, 4] to systematic, reproducible data curation that prioritizes quality, auditing, and strong filtering to reduce noise and contamination. Recent efforts such as FineWeb [69] and FineWeb2 [12] have formalized the curation process, releasing reproducible pipelines that scale high-quality filtering across thousands of languages. In addition to these advances in pre-training, the Aya initiative [17] presents a multilingual post-training dataset focused on instruction-following across 65 languages.
Beyond the construction of large-scale multilingual corpora, there have also been recent efforts to improve the quality of these corpora via multilingual curation; both [17] and [145] propose general purpose model-based filtering solutions to improve multilingual data quality. There are also examples of highly specialized, language-specific curation efforts such as [146] for German and [16] for Indic languages. Our focus in this work is closely aligned with these multilingual curation efforts, and helps to further emphasize the performance improvements which can be unlocked via data curation.
Data Mixing and Interference. While the existence of cross-lingual transfer is well established [148], a central challenge remains how to drive positive transfer across languages while minimizing negative interference [23, 20]. While temperature-based sampling is a standard heuristic [46], it often leads to overfitting in low-resource regimes. Strategies like UniMax [1] address this by capping repetition to ensure more representative coverage. A further area of research is the use of dynamic curricula to drive multilingual performance; [149] advocate for a two-stage training paradigm which first pre-trains on high resource languages and subsequently fine-tunes on lower resource languages. Conversely, [21] arrive at the conclusion that staging the introduction of languages does not yield tangible improvements. While this work contains some examination of curricula and multilingual mixture proportions, the central theme is demonstrating that careful curation can significantly improve cross-lingual transfer dynamics, thus reducing interference.
Multilingual Scaling Laws. While scaling behaviors for English-centric models are well-characterized [15, 31, 32], extending these laws to the multilingual setting introduces significant complexity due to cross-lingual transfer dynamics. Early attempts primarily focused on machine translation [33], but recent work has targeted general-purpose decoder-only architectures. [25] propose a "family-based" scaling law, demonstrating that the test loss for a language family is primarily determined by its own sampling ratio, largely independent of other families in the mixture. This simplifies the analysis of inter-language competition but does not fully account for the "curse of multilinguality" phenomena observed when scaling to many languages. Addressing this, the ATLAS project recently conducted the largest study to date, covering over 400 languages and exploring cross-lingual transfer across 38 languages [30]. Their work derives an adaptive transfer scaling law that explicitly models the trade-off between adding languages and maintaining performance per parameter. This provides a first-principles guide for optimal capacity allocation in massively multilingual settings. While these laws focus on parameter-based trade-offs, our results instead demonstrate that careful, language-specific curation allows us to significantly improve on current scaling laws by shifting the bottleneck from model capacity to data quality. In this way, we are able to demonstrate that the "curse of multilinguality" is the result of correctable deficiencies in the training data.
3. Experimental Setup and Methodology
Pretraining Data. In this work we curate exclusively on top of open source corpora. For English, we leverage the DCLM corpus [38], FineWeb [69], and the non-synthetic components of Nemotron CC v1 [43]. For non-English data, we rely on the FineWeb2 corpus [12]. While FineWeb2 supports over 1,000 languages, in this work we focus on a set of 13 diverse non-English languages spanning multiple writing systems and language families (see Table 1).
The languages above also span a wide range of resource levels in publicly available web text: Spanish is high-resource (with hundreds of billions of available tokens), whereas Hindi, Bengali, and Arabic are comparatively low-resource, making them particularly sensitive to data scarcity and quality.
Data curation. Building on our work at DatologyAI, we develop language-specific data curation pipelines for each of the languages above. For English, we build on our state-of-the-art web curation pipeline [128], which integrates complementary strategies including model-based filtering, embedding-based selection, and targeted synthetic data generation [136]. For each non-English language, we tailor our curation pipeline to language's linguistic and distributional characteristics rather than directly applying the English recipe. Concretely, this includes selecting, validating, and/or training language-appropriate models for 1) filtering, 2) embedding for geometry-based curation, and 3) synthetic rephrasing. We also adapt filtering and mixing strategies to account for script- and language-specific artifacts and varying token scarcity across languages. To quantify the impact of these interventions, we compare to uncurated baselines, defined throughout this work as samples drawn at random from the DCLM for uncurated English and FineWeb2 for uncurated non-English corpora (we note that DCLM and FineWeb2 were heavily curated as part of their development, and use the term "uncurated" as meaning "not subject to DatologyAI curation").
Model. We present results on both 3B and 8B parameter models using a Llama-based architecture [45]. Throughout this work we use the Llama-3.2 tokenizer. All models are trained and evaluated with a context window of 4096 tokens. We note that because the focus of this work is on the effects of data curation, we did not attempt to optimize model quality via any means other than data curation. All models of a given size used identical training configurations in every way except the dataset.
Evaluation. We evaluate our models on three complementary multilingual benchmarks:
- Multilingual MMLU [29]: measures broad knowledge and academic-style reasoning across diverse subject areas, including STEM, humanities, and social sciences.
- Multilingual ARC Challenge [34]: measures multi-step reasoning on grade-school science questions; it covers a narrower domain than MMLU but places greater emphasis on compositional reasoning.
- Belebele [24]: measures multilingual reading comprehension and semantic reasoning over aligned passages, with minimal dependence on memorized factual knowledge.
We complement these multilingual evaluations with English MMLU and ARC evaluations. Note that we restrict our English-language evaluations to MMLU and ARC-Challenge for parity with the multilingual evaluations, and reserve comprehensive English and quantitative benchmarking for forthcoming companion releases. Throughout this work, we rely on the lighteval [26] framework for all our evaluations. We report zero-shot performance. For large-scale experiments (e.g., Figure 1), we adopt the multiple-choice formulation (MCF), following common practice [139, 38]. For smaller runs (e.g., our 3B, 60B-token setting), we instead use the cloze formulation. This choice is motivated by statistical efficiency: cloze-style scoring yields a denser learning signal and typically reduces variance relative to discrete option selection, making it better suited for low-resource or early-training regimes where multiple-choice accuracy can be dominated by near-random guessing [139, 38]. Finally, we note that all models in this manuscript are base (pretrained) models and were evaluated without any post-training or fine-tuning. A full list of evaluation datasets by language is provided in Appendix A.1.
4. Main Findings
4.1 The impact of curation on multilingual transfer dynamics and language-specific performance
Cross-lingual transfer refers to the observation that improving representations in one language can benefit performance in other languages. As models scale and language coverage increases, such transfer becomes increasingly important. In this section, we investigate the impact of data quality on cross-lingual transfer and identify opportunities to improve downstream performance via data curation interventions on both English and non-English data.
4.1.1 Improving English data quality improves cross-lingual performance
Most multilingual language models are trained on predominantly English corpora, making English data quality a central determinant of multilingual performance. Yet the extent to which English data quality governs cross-lingual transfer remains insufficiently characterized. In a series of controlled experiments, we show that English-to–non-English transfer is strongly mediated by the quality of the English training data. In particular, improving the quality of the English portion of training data mixtures yields consistent performance gains across almost all non-English languages considered. We train a suite of 3B-parameter models for 60B tokens under a range of dataset compositions, focusing on bilingual settings consisting of English paired with a single "target" language. This design yields 13 language pairs (e.g., English–Spanish, English–German, etc), each trained with a fixed 50/50 mixture ratio. For every pair, we compare three curation regimes:
- Uncurated English DCLM and uncurated FineWeb2 non-English data (i.e., random samples from DCLM and FineWeb2).
- DatologyAI-curated English and uncurated FineWeb2 non-English data.
- DatologyAI-curated English and DatologyAI-curated non-English data.
Cross-lingual impact of English curation. Figure 2 summarizes performance across 13 languages, reporting average scores over multilingual MMLU, ARC Challenge, and Belebele. English-only curation (light blue bars) yields consistent gains over the uncurated baseline (dark blue bars) in every language except Bengali, indicating that improving English data quality alone can measurably strengthen multilingual capabilities in otherwise uncurated languages. Averaged across languages, English curation yields a 3.91% relative improvement in non-English performance compared to an uncurated baseline.
4.1.2 Cross-lingual curation gains correlate with language similarity
While English curation improves performance in the uncurated language for 12 out of 13 examined languages, the magnitude of these benefits is not uniform. Languages such as Spanish, French, and German, which are linguistically more similar to English, exhibit more pronounced uplifts than languages such as Hindi and Arabic (8.56% compared to 3.94% relative gains, respectively). This finding is in line with [30], who report that bilingual transfer (as measured by cross-entropy loss) is predicted by language similarity across varying sampling ratios. We ask a similar, though distinct question here: what is the relationship between language similarity and the impact of English curation on non-English model capabilities. We consider two distinct heuristic approaches to quantify linguistic distances: similarity in embedding space and perplexity. Crucially, to ensure these metrics capture linguistic divergence rather than topical shifts in the underlying text, we compute both measures on parallel samples from the FLoRes dataset [52]. For embedding distance, we report the average cosine distance between English and the target language across three distinct models: LaBSE [8], e5-small [10], and sentence-transformers [9]. Our perplexity proxy is defined as the average log perplexity per word as measured on the target language samples under a model trained exclusively on curated English data. We explicitly do not normalize by word length, allowing this metric to serve as a raw measure of how well English-centric patterns generalize to the target distribution.
Figure 3 illustrates a significant negative correlation between proxy distance metrics and the relative improvement gained through English curation alone. Specifically, embedding distance yields a Pearson correlation of -0.62 (p=0.024), while perplexity shows an even stronger correlation of -0.70 (p=0.018). These results provide evidence that language similarity, quantified using two distinct approaches, is significantly correlated with the cross-lingual gains from English only curation.
4.1.3 Optimal multilingual performance requires bespoke multilingual curation
While curating English consistently improves cross-lingual performance, it is not sufficient to reach optimal performance in any given target language. Figure 2 shows a persistent gap in performance when non-English data is curated (purple) versus when it is not (light and dark blue). Moreover, we note that the magnitude of improvements from curating for each language individually far exceeds the benefits associated with only curating English data. This finding, while not necessarily surprising, further emphasizes the need to pay careful attention to data quality and curation for each language individually. This observation reinforces the argument by [50] that generic, English-centric heuristics will not generalize across diverse alphabets and scripts. These results are also consistent with [21], who posited that it is the presence of noisy, uncurated data which harms multilingual models rather than an issue of model capacity. That is, the issue "resembles a curse of data quality" rather than a curse of multilinguality.
4.1.4 Improved non-English data curation also benefits English capabilities
Prior sections highlighted that while English curation improves non-English performance, optimal non-English performance comes from careful curation of both English and non-English data. In this section we study the effect of multilingual data curation on English performance. Despite many recent findings, we observe that the benefits of data curation are bidirectional: our results demonstrate that improving the quality of the non-English data component also yields consistent gains on English benchmarks. Figure 4 compares the performance on English tasks (MMLU and ARC Average) between models trained with uncurated versus curated non-English data, while keeping the English data constant (we use curated English data throughout). The results demonstrate an average relative improvement of 1.2%. Concretely, in 12 out of the 13 language considered, the bilingual model trained with fully curated data (i.e., both English and non-English data curated) outperformed the version with uncurated non-English data.
At a high-level, these findings suggest that high-quality data acts as a globally beneficial signal in model training, providing a means to mitigate the "curse of multilinguality" by systematically improving the quality of data for each individual language.
4.2 The efficacy of translation as augmentation is determined by source quality
Machine translation has increasingly surfaced as a viable strategy for enhancing multilingual model performance, serving as a scalable source of synthetic data [42, 51]. Prior efforts, such as FineTranslations [22], have successfully utilized large-scale translation pipelines to map multilingual content into English with an explicit focus of improving translation capabilities. In this work, we instead investigate the effectiveness of translation as a general tool to drive overall multilingual capabilities. We demonstrate that this strategy can indeed drive performance gains; however, its effectiveness is heavily contingent on source data quality. Our results indicate that translating only high-quality documents, selected via score filters, leads to markedly better improvements.
To understand how to best leverage translation as a tool within multilingual data curation, we conducted controlled experiments on three languages: Hindi, Bengali, and Arabic. We trained 3B parameter models for 60B tokens, keeping the English component fixed and curated, while varying the non-English strategy. We compared three approaches: (1) an uncurated baseline, (2) augmenting the baseline with translations of randomly selected English data (i.e., blind translations), and (3) augmenting with translations of high-quality, scored English data. When scoring English data, we used a fasttext classifier similar to [69].
The results, illustrated in Figure 5, reveal a clear hierarchy of performance. Augmenting the uncurated baseline with arbitarily translated English data only yields marginal performance improvements over a purely uncurated baseline. However, the magnitude of improvements grows when translating high quality, score-filtered data. This result mirrors findings in [136], which showed that the quality of input documents for synthetic rephrasing is crucial to obtaining strong performance.
However, a significant performance gap remains. Our bespoke curation strategy (dark blue line) substantially outperforms both the uncurated baseline and the translation-augmented models. These findings imply that while translation can be a valuable component of multilingual data curation, as reported by [51], its efficacy is ultimately determined by source data quality, and the best performance is obtained with a holistic curation approach across all target languages.
4.3 Integrating multilingual curation into a general pretraining mix
Sections 4.1 and 4.2 presented smaller-scale, controlled experiments intended to dissociate the impact of different multilingual curation choices. However, an open question is how such curation strategies can scale to larger token budgets and models, and how multilingual curation interacts with general purpose curation. To that end, we curated a 20T token dataset intended for foundation model training and frontier capabilities across English, multilingual, code, STEM, and reasoning skills. The curation included generating over 8T tokens of synthetic English and non-English web data, and code and STEM data. A random 1T subset of this dataset was used to train both 3B and 8B parameter models following a Llama architecture.
Multi-Phase Data Curriculum. To balance multiple diverse data streams, we implemented a multi-phase training curriculum that progressively increases the density of multilingual tokens. The mixture of tokens across three phases followed that used in the Trinity Large model [39]. The training process was divided into three distinct phases:
- Phase 1: 650B tokens with 5% multilingual data
- Phase 2: 250B tokens with 10% multilingual data
- Phase 3: 100B tokens with 20% multilingual data
Across the full training duration, this schedule resulted in an overall allocation of 7.75% tokens to our multilingual curation pipeline, supporting 13 languages and thus resulting in an average of 6B tokens per langauge. Despite this seemingly modest multilingual budget, our bespoke curation strategy allows models to obtain competitive performance across diverse languages spanning Latin, Cyrillic, Arabic, Indic, and CJK scripts.
Establishing a New Pareto Frontier. Figure 1 positions DatologyAI models against several open-weights models across English (MMLU + ARC Average) and three multilingual benchmarks (Multilingual MMLU, ARC, and Belebele). The y-axis denotes the error rate (1 - Average Accuracy) in log scale, where lower values indicate superior capabilities. The shaded gray region encapsulates the performance-compute trade-off established by leading open-weights baselines, including Qwen3 [129], Granite [5], SmolLM3 [6] and LFM-2.5 [141]. Our results demonstrate a marked shift in efficiency: the DatologyAI models consistently improve upon the established Pareto frontier. By achieving error rates comparable to significantly larger and/or more compute-intensive baselines, we effectively redefine the Pareto frontier for multilingual foundational models.
Data curation unlocks token efficient data mixtures. We emphasize that the DatologyAI models in Figure 1 use the smallest multilingual data mixture among all models that report mixture composition. Specifically, DatologyAI allocates only 7.75% of training tokens to multilingual data while supporting a substantially broader set of languages. This proportion is markedly lower than those reported by comparable baselines, including LFM, which uses 20% multilingual tokens [141], and SmolLM3, which employs a 12% multilingual mixture [6].
In Appendix A.3, we report per-language performance breakdowns for each evaluation. We present results across three groups of languages: Latin-script languages (Figure 6), Indic and Arabic (Figure 7), and Chinese, Japanese, Korean, and Russian (Figure 8). Across all 13 languages, DatologyAI’s multilingual curation consistently improves the performance–compute Pareto frontier, yielding higher accuracy at a given FLOP budget (or comparable accuracy with less compute). We also present results comparing to various language-specialized base models, i.e. models that have a focus on achieving strong performance on particular languages. Examples include the Sarvam-1 model [96], focused on Indic languages, in Figure 7; Trillion Labs Tri-7B [140], focused on Korean, Japanese, and Chinese, in Figure 8; and SEA-LION-v3-9B [97], focused on Southeast Asian languages, also in Figure 8. The specialized models also significantly improve upon the Pareto frontier, but their performance is comparable to that of the DatologyAI models on the particular languages they focus on: for example Figures 7 and Figures 8 show that DatologyAI models can meet or exceed performance of specialized models such as Sarvam-1 and Tri-7B, which are trained using similar or larger FLOPs budgets.
The rightmost columns in Figures 6-8 illustrate the relationship between language-specific performance and aggregate multilingual proficiency. DatologyAI models consistently align with the line of unity, reflecting a data curation strategy that prioritizes broad multilingual parity over individual language optimization. In contrast, specialized models like Sarvam-1 and Tri-7B exhibit a clear departure from this trend, appearing above the line of unity for their target languages. However, their aggregate multilingual performance (shown along x-axis) reveals a substantial degradation in overall capabilities. This highlights that these models have traded general multilingualism for localized expertise. Notably, models curated with DatologyAI achieve competitive results without necessitating such performance tradeoffs. Finally, Figure 9 visualizes the performance on various individual languages as a function of the number of training tokens in that language for DatologyAI models and the subset of the models we evaluated where we could obtain reasonable estimates for the per-language training tokens (we describe our methodology in Appendix A.4). This figure clearly visualizes the orders-of-magnitude improvements in per-language data efficiency obtained with DatologyAI curation.
Taken together, the results in Figure 1 and Figures 6-9 demonstrate that DatologyAI multilingual data curation is both highly effective and scales to the frontier model training regime. The latter point is reinforced by results from Trinity Large, which was pretrained on 17T tokens drawn from the broader DatologyAI-curated corpus and exhibits exceptionally strong multilingual performance.
5. Conclusion
Multilinguality is an essential capability for modern foundation models, yet achieving high-quality multilingual performance with broad coverage remains challenging due to uneven data availability across languages and the so-called "curse of multilinguality". In this work, we revisit multilingual pretraining from a data-centric perspective and show that many of observed constraints and regressions are not inherent to multilinguality, but instead reflect deficiencies in data quality and curation.
Through controlled bilingual experiments, we demonstrate that cross-lingual transfer is strongly mediated by data quality: improving English curation alone yields consistent gains across nearly all non-English languages examined (3.91% average relative improvement across multilingual MMLU, ARC-Challenge, and Belebele), while improving non-English curation reciprocally benefits English performance (1.21% average relative improvement). These findings challenge a purely zero-sum framing of multilingual modeling: higher-quality training data can strengthen multilingual capability without requiring commensurate sacrifices elsewhere in the mixture.
English curation alone, however, is insufficient for optimal performance. Bespoke, per-language pipelines tailored to linguistic and distributional properties deliver substantially larger gains, reaching 16.87% relative improvement in controlled settings. We further show that translation is most effective when it preserves source quality: translating score-filtered English documents yields materially larger gains than translating arbitrary text, and integrating high-quality document translation as part of a holistic multilingual curation strategy yields far superior results overall.
We productionized these principles through a 20T-token general-purpose pretraining corpus, whose multilingual component was constructed using the curation strategies explored here. Under a controlled 1T-token training budget, 3B and 8B models achieve comparable or stronger multilingual performance than competitive open-weight baselines at 4–10× lower training compute, redefining the multilingual performance–compute Pareto frontier. These efficiency gains persist at frontier scale: Trinity Large Base (400B/A13B), trained on 17T tokens of this corpus, exhibits exceptionally strong multilingual performance relative to its FLOPs budget, validating that the curation principles described here remain effective in the multi–tens-of-trillions regime. We emphasize that for both the 1T-token training budget experiments as well as for Trinity Large, the multilingual performance is obtained using a comparatively minor multilingual token budget of 7.75% of total training tokens.
Several avenues for future work follow. Our results motivate more systematic, compute-aware mixture design, including per-language sampling strategies and phased curricula that balance improvements in one language against interference in others while ensuring adequate support for low-resource languages. Scaling this agenda will likely require more robust multilingual evaluation frameworks [98]. Finally, extending these data-centric principles to multimodal and vision–language model (VLM) training remains an important direction, where evaluation quality and coverage are also central bottlenecks [40].
In conclusion, viewed through the data-centric lens advanced in this work, multilinguality need not be a curse of scale, but instead an opportunity to leverage language-aware curation to achieve inclusive, capable foundation models.
Get in Touch!
If you're interested in pushing the bounds of what's possible with data curation, we're looking for talented Members of Technical Staff who have experience doing data research, building research tooling, translating science into products, and building scalable data products.
If you're interested in training multimodal and/or text models faster, better, or smaller, Get in touch!
Follow us on twitter for insights (and memes) about data!
Contributions and Acknowledgements
Core and technical contributors listed alphabetically.
- Core Contributors: Aldo Gael Carranza, Kaleigh Mentzer, and Ricardo Pio Monti
- Technical Contributors: Alex Fang, Alvin Deng, Amro Abbas, Anshuman Suri, Brett Larsen, Cody Blakeney, Darren Teh, David Schwab, Diego Kiner, Fan Pan, Haakon Mongstad, Haoli Yin, Jack Urbanek, Jason Lee, Jason Telanoff, Josh Wills, Luke Merrick, Maximilian Böther, Parth Doshi, Paul Burstein, Pratyush Maini, Rishabh Adaiga, Sid Joshi, Spandan Das, Tony Jiang, Vineeth Dorna, and Zhengping Wang
- Leadership: Bogdan Gaza, Ari Morcos, and Matthew Leavitt
- Acknowledgements: Liz Gatapia, Jacqueline Liu, Tiffanie Pham, Sylvia Hoang, Kylie Clement, Elise Clark
Appendix
A.1 Evaluation datasets per language
In the table below, Global MMLU is the dataset provided by [29] while Indic MMLU refers to the translation into Indic languages (available here). For ARC evaluations, we rely on evaluation datasets released as part of the Okapi framework [28]. This contains evaluations for the majority of the languages we consider, with the exception of Korean, Portuguese, Hindi and Bengali. For Korean, we use the Ko-ARC evaluation [27], for Portuguese we use the translated version provided by LumiOpen (available here). Finally, for Indic ARC evaluations we use Indic ARC (available here). In the case of Belebele, the original evaluation dataset supports all our languages [24] (available here).
A.2 Details of FLOP budget computations for open-source models
We summarize the training compute (in FLOPs) for each open-source baseline reported in Figure 1. Throughout this work, we estimate training FLOPs using the standard approximation:
Total FLOPs = 6 x N x D,
where N is the number of (trainable) parameters and D is the number of training tokens. In the table, B=109 and T=1012. For MoE models, we use the active parameter count per token as N.
A.3 Per language evaluation performance
Figure 6: Per-language performance vs. training compute for latin-script languages. Rows correspond to Spanish, Portuguese, French, German, Vietnamese, and Indonesian. Columns 1–3 report performance on MMLU, ARC, and Belebele as a function of training FLOPs (x-axis). The rightmost column compares each model’s language-specific average score (y-axis) to its all-language average across multilingual evaluations (x-axis); the dashed line indicates parity (y = x).
Figure 7: Per-language performance vs. training compute for Indic and Arabic-script languages. Rows correspond to Hindi, Bengali, and Arabic. Columns 1–3 report performance on MMLU, ARC, and Belebele as a function of training FLOPs (x-axis). The rightmost column compares each model’s language-specific average score (y-axis) to its all-language average across multilingual evaluations (x-axis); the dashed line indicates parity (y = x).
Figure 8: Per-language performance vs. training compute for CJK languages and Russian. Rows correspond to Chinese, Japanese, Korean, and Russian. Columns 1–3 report performance on MMLU, ARC, and Belebele as a function of training FLOPs (x-axis). The rightmost column compares each model’s language-specific average score (y-axis) to its all-language average across multilingual evaluations (x-axis); the dashed line indicates parity (y = x). We note that there is no ARC Challenge evaluation available for Japanese.
A.4 Multilingual data efficiency gains
In this section, we quantify multilingual performance as a function of the training token count dedicated to each language. Conducting this analysis is challenging for several reasons. First, several of the evaluated models use their own tokenizers, which makes token counts imperfectly comparable across models. Second, precise per-language token counts are often unavailable for open-source models, and so we aim to estimate them as best we can. We nevertheless include this analysis because DatologyAI curation yields improvements in multilingual data efficiency that are large, often by orders of magnitude, so the qualitative conclusion is robust even under reasonable uncertainty in these estimates.
In this section we only include models for which we could obtain a reasonably reliable estimate of per-language tokens using public information. These models are:
- SmolLM3: This model used 12% multilingual data over 11T, supporting a range of languages including Spanish, German, French and Portuguese. We compute the amount of tokens per language directly from the configurations which were publicly shared (available here).
- Llama3.2: This model used 8% multilingual data over 9T, supporting seven languages. This is approximately 100B tokens per language.
- Sarvam-1: This was was trained on a 2T Indic language corpus, which contained 20% Hindi tokens and 10% Bengali tokens. This corresponds to 200B and 100B tokens for each language respectively.
- Trillion Labs 7B: this model was trained on a 2T dataset, 10% of which was multilingual with a primary focus on Korean. As such, we estimated this model was trained with approximately 200B Korean tokens.
- DatologyAI: as referenced in section 4, DatologyAI models were trained for 1T tokens with a 7.75% multilingual component. This corresponds to a total of 75B multilingual tokens across thirteen languages, approximately 6B tokens per language.
Figure 9 visualizes the performance across various languages as a function of estimated tokens in that language. We observe that DatologyAI curation has orders of magnitude gains in token efficiency.
Figure 9: Per-Language Performance vs. Multilingual Training Tokens. We visualize the number of language-specific training tokens (x-axis, billions) and the average downstream performance across a range of models. We only include models where the number of tokens per langauge could be reasonably estimated based on public information. DatologyAI models were trained with only 6B tokens per language (7.75\% mutlilingual overall). The plots demonstrate significant data efficiency improvements from DatologyAI curation compared to open-source baselines such as Llama-3.2, SmolLM2, and language-specific models like Sarvam-2B. We add an asterix beside the name of all non-DatologyAI models, to highlight that we estimated the number of tokens per language to the best of our ability based on publicly available information.
References
Ready for better data?
Let’s make models better through better data, automatically.
Book a Call