Research Updates

18 Feb, 2026

ÜberWeb: Insights from Multilingual Curation for a 20-Trillion-Token Dataset

Turning the curse of multilinguality into a blessing with principled data curation

Written by

DDatologyAI

Published on18 Feb, 2026

Share on

Table of Contents

Executive Summary

1. Introduction

2. Related Work

3. Experimental Setup and Methodology

5. Conclusion

Get in Touch!

Contributions and Acknowledgements

*NOTE: If you prefer a PDF, the ArXiv version of this post can be found here.

Executive Summary

Multilinguality is a core capability for modern foundation models, yet training high-quality multilingual models remains challenging due to uneven data availability across languages. A further challenge is the performance interference that can arise from joint multilingual training, commonly referred to as the “curse of multilinguality”. We study multilingual data curation across thirteen languages spanning multiple scripts, language families, and resource levels, showing that many reported regressions are not inherent to multilingual scaling but instead stem from correctable deficiencies in data quality and composition rather than fundamental capacity limits. In controlled bilingual experiments, improving data quality for any single language benefits others: curating English improves non-English performance on MMLU, ARC-Challenge, and Belebele in 12 of 13 languages (3.9% average relative gain), while curating non-English yields reciprocal improvements in English (1.2% average gain). Bespoke per-language curation produces substantially larger within-language improvements, with up to 16.9% relative gains over uncurated baselines. Extending these findings to large-scale general-purpose training mixtures, we show that curated multilingual allocations comprising under 8% of total tokens remain remarkably effective. We operationalize this approach within a broader large-scale effort that produced a 20T-token pretraining corpus derived entirely from public sources. Models with 3B and 8B parameters trained on a 1T-token random subset achieve competitive multilingual accuracy with 4–10× fewer training FLOPs than strong public baselines, establishing a new Pareto frontier in multilingual performance versus compute (Figure 1). Moreover, these benefits extend to frontier model scale: the 20T-token corpus served as part of the pretraining dataset for Trinity Large (400B/A13B), which exhibits strong multilingual performance relative to its training FLOPs. Together, these results show that targeted, per-language data curation mitigates multilingual interference and enables compute-efficient multilingual scaling.

The future is already here – it's just not evenly distributed.
— William Gibson

1. Introduction

Large language models (LLMs) have fundamentally reshaped the landscape of artificial intelligence, yet their benefits remain unevenly distributed across languages. Although modern models have demonstrated remarkable capabilities in English, these capabilities often degrade substantially when applied to non-English settings [48, 37]. Bridging this gap is not merely an architectural challenge, but fundamentally a data-centric one: training on large volumes of high-quality data is essential for achieving frontier-level model capabilities. English benefits from multiple large-scale, carefully curated public corpora [44, 69, 38, 43], whereas multilingual corpora are far more fragmented. Many non-English languages occupy a long tail characterized by limited, noisy, or inconsistently curated data, constraining multilingual model performance regardless of architectural capacity.

Beyond uneven data availability across languages, multilingual modeling faces an additional distinct challenge: the so-called "curse of multilinguality" [23, 36]. This term refers to the empirical observation that training a single model across an increasing number of languages often leads to degraded per-language performance, even under comparable training budgets. Historically, the consensus view attributed this phenomenon to a capacity bottleneck, framing multilingual modeling as a zero-sum game in which distinct languages compete for finite parameters or model capacity [2, 23, 36]. Under this paradigm, the primary solutions have been to scale model size [14, 35] or increase the number of training tokens [30], both of which substantially increase the computational cost of multilingual training. Such strategies, however, assume access to abundant, high-quality multilingual text, bringing them into direct tension with the uneven data availability across languages discussed above.

Recent evidence suggests this capacity-centric view is incomplete. Emerging research indicates that the "curse" may stem less from parameter scarcity and more from the interference caused by suboptimal data quality. Both [42] and [21] demonstrate that the trade-off between English and multilingual performance is not inevitable; they find that replacing significant portions of English data with high-quality multilingual text need not degrade English capabilities. Similarly, advances in model-based data selection [50] and systematic filtering pipelines [12] reveal that when data quality is rigorously controlled, models can accommodate significantly more linguistic diversity without quality degradation. Taken together, these findings suggest that the apparent capacity constraints in multilingual scaling are often induced by low-quality data. This motivates a shift in emphasis: optimal scaling of multilingual capabilities requires intentional, multilingual-targeted data curation.

In this work, we study multilingual foundation model training through the lens of data curation, arguing that careful curation can simultaneously address the two central challenges of multilingual modeling: limited high-quality data for many languages and performance interference arising from joint multilingual training. By improving data quality, curation enhances cross-lingual transfer, reducing the amount of language-specific data required to achieve strong performance. The complement to this phenomenon is that targeted multilingual curation mitigates interference effects, alleviating the curse of multilinguality without relying solely on increasing compute. We validate these claims through controlled 60B-token bilingual studies, large-scale 1T-token pretraining, and frontier-scale pretraining at multi-tens-of-trillions of tokens.

We summarize the key contributions of this work as follows:

Cross-lingual transfer improves with data quality. We demonstrate that refining data quality drives significant cross-lingual performance gains through controlled bilingual experiments with 3B-parameter models trained on 60B tokens. Crucially, we find this relationship is bidirectional: enhancing the quality of English data improves non-English performance in 12 out of 13 examined languages, yielding an average relative improvement of 3.91% across multilingual evaluations, while improving the quality of non-English data benefits English capabilities in 12 out of 13 languages, with an average relative improvement of 1.21% on English evaluations.
Optimal performance requires bespoke multilingual curation. While English data curation does improve multilingual capabilities, we find that the best performance is obtained when tailored curation pipelines are built for each language. Our findings highlight that English-centric curation strategies cannot be applied blindly to other languages. Instead, it is imperative to construct tailored pipelines designed for each language's specific needs. For our 3B-parameter models trained on 60B tokens, while English curation alone drove the aforementioned 3.91% relative improvement, applying bespoke language-specific curation yielded a significantly higher 16.87% relative improvement over the uncurated baseline.
Data quality persists through translation. Building on findings of recent large-scale translation efforts [51, 22], we explore various strategies to translate English data into non-English languages. Large-scale translation provides a mechanism for expanding training data across languages, but we find that the choice of source data critically determines its effectiveness. We observe that prioritizing high-quality English documents for translation can significantly boost performance over translations of arbitrary English documents. In experiments with 3B-parameter models, we find that augmenting the uncurated baseline with translations of random English data yields marginal gains, whereas translating high-quality, score-filtered English data leads to an average relative improvement of 5.09% over an uncurated baseline. Moreover, we find that translation is most effective when embedded within a holistic, per-language curation framework, which yields the strongest overall performance.
Curation makes multilingual scaling remarkably compute-efficient. Under a 1T-token training budget drawn from a curated general-purpose pretraining corpus, we find that allocating approximately 8% of tokens to high-quality multilingual data (~80B tokens across 13 languages) is sufficient to achieve very strong multilingual performance, in many cases comparable to or exceeding competitive open-weight models. Our 3B and 8B models trained for 1T tokens achieve 4-10x greater training FLOPs efficiency than strong public baselines. For example, a DatologyAI 3B model trained for 1T tokens (1.8 x 10^22 FLOPs) outperforms LFM-2.5-1.2B [141], a 1.2B model trained for 28T tokens (1.9 x 10^23 FLOPs). Similarly, a DatologyAI 8B model trained for 1T tokens (4.8 x 10^22 FLOPs) outperforms SmolLM3-3B [6] and Granite-4.0-3B [5], both trained with an order of magnitude more compute. Importantly, these efficiency gains persist at frontier scale: our multilingual curation framework forms part of the 17T-token pretraining corpus for Trinity Large Base (400B-parameter MoE, 13B active; [39]), which exhibits exceptionally strong multilingual performance for its training FLOPs budget.

Our results demonstrate the critical role of multilingual data curation for multilingual model capability. From controlled 60B-token studies to 1T-token training mixtures to 17T-token frontier-scale pretraining, we show that language-aware improvements in data quality systematically enhance cross-lingual transfer, mitigate multilingual interference, and substantially improve within-language performance. These effects collectively shift the performance-compute Pareto frontier (see Figure 1). Together, these findings position high-quality, per-language curation as a practical and scalable mechanism for compute-efficient multilingual foundation model training, advancing progress toward more language-inclusive foundation models and a more evenly-distributed future.

Multilingual Data Curation. The field has moved from scale-focused web ingestion [2, 4] to systematic, reproducible data curation that prioritizes quality, auditing, and strong filtering to reduce noise and contamination. Recent efforts such as FineWeb [69] and FineWeb2 [12] have formalized the curation process, releasing reproducible pipelines that scale high-quality filtering across thousands of languages. In addition to these advances in pre-training, the Aya initiative [17] presents a multilingual post-training dataset focused on instruction-following across 65 languages.

Beyond the construction of large-scale multilingual corpora, there have also been recent efforts to improve the quality of these corpora via multilingual curation; both [17] and [145] propose general purpose model-based filtering solutions to improve multilingual data quality. There are also examples of highly specialized, language-specific curation efforts such as [146] for German and [16] for Indic languages. Our focus in this work is closely aligned with these multilingual curation efforts, and helps to further emphasize the performance improvements which can be unlocked via data curation.

Data Mixing and Interference. While the existence of cross-lingual transfer is well established [148], a central challenge remains how to drive positive transfer across languages while minimizing negative interference [23, 20]. While temperature-based sampling is a standard heuristic [46], it often leads to overfitting in low-resource regimes. Strategies like UniMax [1] address this by capping repetition to ensure more representative coverage. A further area of research is the use of dynamic curricula to drive multilingual performance; [149] advocate for a two-stage training paradigm which first pre-trains on high resource languages and subsequently fine-tunes on lower resource languages. Conversely, [21] arrive at the conclusion that staging the introduction of languages does not yield tangible improvements. While this work contains some examination of curricula and multilingual mixture proportions, the central theme is demonstrating that careful curation can significantly improve cross-lingual transfer dynamics, thus reducing interference.

Multilingual Scaling Laws. While scaling behaviors for English-centric models are well-characterized [15, 31, 32], extending these laws to the multilingual setting introduces significant complexity due to cross-lingual transfer dynamics. Early attempts primarily focused on machine translation [33], but recent work has targeted general-purpose decoder-only architectures. [25] propose a "family-based" scaling law, demonstrating that the test loss for a language family is primarily determined by its own sampling ratio, largely independent of other families in the mixture. This simplifies the analysis of inter-language competition but does not fully account for the "curse of multilinguality" phenomena observed when scaling to many languages. Addressing this, the ATLAS project recently conducted the largest study to date, covering over 400 languages and exploring cross-lingual transfer across 38 languages [30]. Their work derives an adaptive transfer scaling law that explicitly models the trade-off between adding languages and maintaining performance per parameter. This provides a first-principles guide for optimal capacity allocation in massively multilingual settings. While these laws focus on parameter-based trade-offs, our results instead demonstrate that careful, language-specific curation allows us to significantly improve on current scaling laws by shifting the bottleneck from model capacity to data quality. In this way, we are able to demonstrate that the "curse of multilinguality" is the result of correctable deficiencies in the training data.

3. Experimental Setup and Methodology

Pretraining Data. In this work we curate exclusively on top of open source corpora. For English, we leverage the DCLM corpus [38], FineWeb [69], and the non-synthetic components of Nemotron CC v1 [43]. For non-English data, we rely on the FineWeb2 corpus [12]. While FineWeb2 supports over 1,000 languages, in this work we focus on a set of 13 diverse non-English languages spanning multiple writing systems and language families (see Table 1).

The languages above also span a wide range of resource levels in publicly available web text: Spanish is high-resource (with hundreds of billions of available tokens), whereas Hindi, Bengali, and Arabic are comparatively low-resource, making them particularly sensitive to data scarcity and quality.

Data curation. Building on our work at DatologyAI, we develop language-specific data curation pipelines for each of the languages above. For English, we build on our state-of-the-art web curation pipeline [128], which integrates complementary strategies including model-based filtering, embedding-based selection, and targeted synthetic data generation [136]. For each non-English language, we tailor our curation pipeline to language's linguistic and distributional characteristics rather than directly applying the English recipe. Concretely, this includes selecting, validating, and/or training language-appropriate models for 1) filtering, 2) embedding for geometry-based curation, and 3) synthetic rephrasing. We also adapt filtering and mixing strategies to account for script- and language-specific artifacts and varying token scarcity across languages. To quantify the impact of these interventions, we compare to uncurated baselines, defined throughout this work as samples drawn at random from the DCLM for uncurated English and FineWeb2 for uncurated non-English corpora (we note that DCLM and FineWeb2 were heavily curated as part of their development, and use the term "uncurated" as meaning "not subject to DatologyAI curation").

Model. We present results on both 3B and 8B parameter models using a Llama-based architecture [45]. Throughout this work we use the Llama-3.2 tokenizer. All models are trained and evaluated with a context window of 4096 tokens. We note that because the focus of this work is on the effects of data curation, we did not attempt to optimize model quality via any means other than data curation. All models of a given size used identical training configurations in every way except the dataset.

Evaluation. We evaluate our models on three complementary multilingual benchmarks:

Multilingual MMLU [29]: measures broad knowledge and academic-style reasoning across diverse subject areas, including STEM, humanities, and social sciences.
Multilingual ARC Challenge [34]: measures multi-step reasoning on grade-school science questions; it covers a narrower domain than MMLU but places greater emphasis on compositional reasoning.
Belebele [24]: measures multilingual reading comprehension and semantic reasoning over aligned passages, with minimal dependence on memorized factual knowledge.

We complement these multilingual evaluations with English MMLU and ARC evaluations. Note that we restrict our English-language evaluations to MMLU and ARC-Challenge for parity with the multilingual evaluations, and reserve comprehensive English and quantitative benchmarking for forthcoming companion releases. Throughout this work, we rely on the lighteval [26] framework for all our evaluations. We report zero-shot performance. For large-scale experiments (e.g., Figure 1), we adopt the multiple-choice formulation (MCF), following common practice [139, 38]. For smaller runs (e.g., our 3B, 60B-token setting), we instead use the cloze formulation. This choice is motivated by statistical efficiency: cloze-style scoring yields a denser learning signal and typically reduces variance relative to discrete option selection, making it better suited for low-resource or early-training regimes where multiple-choice accuracy can be dominated by near-random guessing [139, 38]. Finally, we note that all models in this manuscript are base (pretrained) models and were evaluated without any post-training or fine-tuning. A full list of evaluation datasets by language is provided in Appendix A.1.

4. Main Findings

4.1 The impact of curation on multilingual transfer dynamics and language-specific performance

Cross-lingual transfer refers to the observation that improving representations in one language can benefit performance in other languages. As models scale and language coverage increases, such transfer becomes increasingly important. In this section, we investigate the impact of data quality on cross-lingual transfer and identify opportunities to improve downstream performance via data curation interventions on both English and non-English data.

4.1.1 Improving English data quality improves cross-lingual performance

Most multilingual language models are trained on predominantly English corpora, making English data quality a central determinant of multilingual performance. Yet the extent to which English data quality governs cross-lingual transfer remains insufficiently characterized. In a series of controlled experiments, we show that English-to–non-English transfer is strongly mediated by the quality of the English training data. In particular, improving the quality of the English portion of training data mixtures yields consistent performance gains across almost all non-English languages considered. We train a suite of 3B-parameter models for 60B tokens under a range of dataset compositions, focusing on bilingual settings consisting of English paired with a single "target" language. This design yields 13 language pairs (e.g., English–Spanish, English–German, etc), each trained with a fixed 50/50 mixture ratio. For every pair, we compare three curation regimes:

Uncurated English DCLM and uncurated FineWeb2 non-English data (i.e., random samples from DCLM and FineWeb2).
DatologyAI-curated English and uncurated FineWeb2 non-English data.
DatologyAI-curated English and DatologyAI-curated non-English data.

Cross-lingual impact of English curation. Figure 2 summarizes performance across 13 languages, reporting average scores over multilingual MMLU, ARC Challenge, and Belebele. English-only curation (light blue bars) yields consistent gains over the uncurated baseline (dark blue bars) in every language except Bengali, indicating that improving English data quality alone can measurably strengthen multilingual capabilities in otherwise uncurated languages. Averaged across languages, English curation yields a 3.91% relative improvement in non-English performance compared to an uncurated baseline.

4.1.2 Cross-lingual curation gains correlate with language similarity

While English curation improves performance in the uncurated language for 12 out of 13 examined languages, the magnitude of these benefits is not uniform. Languages such as Spanish, French, and German, which are linguistically more similar to English, exhibit more pronounced uplifts than languages such as Hindi and Arabic (8.56% compared to 3.94% relative gains, respectively). This finding is in line with [30], who report that bilingual transfer (as measured by cross-entropy loss) is predicted by language similarity across varying sampling ratios. We ask a similar, though distinct question here: what is the relationship between language similarity and the impact of English curation on non-English model capabilities. We consider two distinct heuristic approaches to quantify linguistic distances: similarity in embedding space and perplexity. Crucially, to ensure these metrics capture linguistic divergence rather than topical shifts in the underlying text, we compute both measures on parallel samples from the FLoRes dataset [52]. For embedding distance, we report the average cosine distance between English and the target language across three distinct models: LaBSE [8], e5-small [10], and sentence-transformers [9]. Our perplexity proxy is defined as the average log perplexity per word as measured on the target language samples under a model trained exclusively on curated English data. We explicitly do not normalize by word length, allowing this metric to serve as a raw measure of how well English-centric patterns generalize to the target distribution.

Figure 3 illustrates a significant negative correlation between proxy distance metrics and the relative improvement gained through English curation alone. Specifically, embedding distance yields a Pearson correlation of -0.62 (p=0.024), while perplexity shows an even stronger correlation of -0.70 (p=0.018). These results provide evidence that language similarity, quantified using two distinct approaches, is significantly correlated with the cross-lingual gains from English only curation.

4.1.3 Optimal multilingual performance requires bespoke multilingual curation

While curating English consistently improves cross-lingual performance, it is not sufficient to reach optimal performance in any given target language. Figure 2 shows a persistent gap in performance when non-English data is curated (purple) versus when it is not (light and dark blue). Moreover, we note that the magnitude of improvements from curating for each language individually far exceeds the benefits associated with only curating English data. This finding, while not necessarily surprising, further emphasizes the need to pay careful attention to data quality and curation for each language individually. This observation reinforces the argument by [50] that generic, English-centric heuristics will not generalize across diverse alphabets and scripts. These results are also consistent with [21], who posited that it is the presence of noisy, uncurated data which harms multilingual models rather than an issue of model capacity. That is, the issue "resembles a curse of data quality" rather than a curse of multilinguality.

4.1.4 Improved non-English data curation also benefits English capabilities

Prior sections highlighted that while English curation improves non-English performance, optimal non-English performance comes from careful curation of both English and non-English data. In this section we study the effect of multilingual data curation on English performance. Despite many recent findings, we observe that the benefits of data curation are bidirectional: our results demonstrate that improving the quality of the non-English data component also yields consistent gains on English benchmarks. Figure 4 compares the performance on English tasks (MMLU and ARC Average) between models trained with uncurated versus curated non-English data, while keeping the English data constant (we use curated English data throughout). The results demonstrate an average relative improvement of 1.2%. Concretely, in 12 out of the 13 language considered, the bilingual model trained with fully curated data (i.e., both English and non-English data curated) outperformed the version with uncurated non-English data.

At a high-level, these findings suggest that high-quality data acts as a globally beneficial signal in model training, providing a means to mitigate the "curse of multilinguality" by systematically improving the quality of data for each individual language.

4.2 The efficacy of translation as augmentation is determined by source quality

Machine translation has increasingly surfaced as a viable strategy for enhancing multilingual model performance, serving as a scalable source of synthetic data [42, 51]. Prior efforts, such as FineTranslations [22], have successfully utilized large-scale translation pipelines to map multilingual content into English with an explicit focus of improving translation capabilities. In this work, we instead investigate the effectiveness of translation as a general tool to drive overall multilingual capabilities. We demonstrate that this strategy can indeed drive performance gains; however, its effectiveness is heavily contingent on source data quality. Our results indicate that translating only high-quality documents, selected via score filters, leads to markedly better improvements.

To understand how to best leverage translation as a tool within multilingual data curation, we conducted controlled experiments on three languages: Hindi, Bengali, and Arabic. We trained 3B parameter models for 60B tokens, keeping the English component fixed and curated, while varying the non-English strategy. We compared three approaches: (1) an uncurated baseline, (2) augmenting the baseline with translations of randomly selected English data (i.e., blind translations), and (3) augmenting with translations of high-quality, scored English data. When scoring English data, we used a fasttext classifier similar to [69].

The results, illustrated in Figure 5, reveal a clear hierarchy of performance. Augmenting the uncurated baseline with arbitarily translated English data only yields marginal performance improvements over a purely uncurated baseline. However, the magnitude of improvements grows when translating high quality, score-filtered data. This result mirrors findings in [136], which showed that the quality of input documents for synthetic rephrasing is crucial to obtaining strong performance.

However, a significant performance gap remains. Our bespoke curation strategy (dark blue line) substantially outperforms both the uncurated baseline and the translation-augmented models. These findings imply that while translation can be a valuable component of multilingual data curation, as reported by [51], its efficacy is ultimately determined by source data quality, and the best performance is obtained with a holistic curation approach across all target languages.

4.3 Integrating multilingual curation into a general pretraining mix

Sections 4.1 and 4.2 presented smaller-scale, controlled experiments intended to dissociate the impact of different multilingual curation choices. However, an open question is how such curation strategies can scale to larger token budgets and models, and how multilingual curation interacts with general purpose curation. To that end, we curated a 20T token dataset intended for foundation model training and frontier capabilities across English, multilingual, code, STEM, and reasoning skills. The curation included generating over 8T tokens of synthetic English and non-English web data, and code and STEM data. A random 1T subset of this dataset was used to train both 3B and 8B parameter models following a Llama architecture.

Multi-Phase Data Curriculum. To balance multiple diverse data streams, we implemented a multi-phase training curriculum that progressively increases the density of multilingual tokens. The mixture of tokens across three phases followed that used in the Trinity Large model [39]. The training process was divided into three distinct phases:

Phase 1: 650B tokens with 5% multilingual data
Phase 2: 250B tokens with 10% multilingual data
Phase 3: 100B tokens with 20% multilingual data

Across the full training duration, this schedule resulted in an overall allocation of 7.75% tokens to our multilingual curation pipeline, supporting 13 languages and thus resulting in an average of 6B tokens per langauge. Despite this seemingly modest multilingual budget, our bespoke curation strategy allows models to obtain competitive performance across diverse languages spanning Latin, Cyrillic, Arabic, Indic, and CJK scripts.

Establishing a New Pareto Frontier. Figure 1 positions DatologyAI models against several open-weights models across English (MMLU + ARC Average) and three multilingual benchmarks (Multilingual MMLU, ARC, and Belebele). The y-axis denotes the error rate (1 - Average Accuracy) in log scale, where lower values indicate superior capabilities. The shaded gray region encapsulates the performance-compute trade-off established by leading open-weights baselines, including Qwen3 [129], Granite [5], SmolLM3 [6] and LFM-2.5 [141]. Our results demonstrate a marked shift in efficiency: the DatologyAI models consistently improve upon the established Pareto frontier. By achieving error rates comparable to significantly larger and/or more compute-intensive baselines, we effectively redefine the Pareto frontier for multilingual foundational models.

Data curation unlocks token efficient data mixtures. We emphasize that the DatologyAI models in Figure 1 use the smallest multilingual data mixture among all models that report mixture composition. Specifically, DatologyAI allocates only 7.75% of training tokens to multilingual data while supporting a substantially broader set of languages. This proportion is markedly lower than those reported by comparable baselines, including LFM, which uses 20% multilingual tokens [141], and SmolLM3, which employs a 12% multilingual mixture [6].

In Appendix A.3, we report per-language performance breakdowns for each evaluation. We present results across three groups of languages: Latin-script languages (Figure 6), Indic and Arabic (Figure 7), and Chinese, Japanese, Korean, and Russian (Figure 8). Across all 13 languages, DatologyAI’s multilingual curation consistently improves the performance–compute Pareto frontier, yielding higher accuracy at a given FLOP budget (or comparable accuracy with less compute). We also present results comparing to various language-specialized base models, i.e. models that have a focus on achieving strong performance on particular languages. Examples include the Sarvam-1 model [96], focused on Indic languages, in Figure 7; Trillion Labs Tri-7B [140], focused on Korean, Japanese, and Chinese, in Figure 8; and SEA-LION-v3-9B [97], focused on Southeast Asian languages, also in Figure 8. The specialized models also significantly improve upon the Pareto frontier, but their performance is comparable to that of the DatologyAI models on the particular languages they focus on: for example Figures 7 and Figures 8 show that DatologyAI models can meet or exceed performance of specialized models such as Sarvam-1 and Tri-7B, which are trained using similar or larger FLOPs budgets.

The rightmost columns in Figures 6-8 illustrate the relationship between language-specific performance and aggregate multilingual proficiency. DatologyAI models consistently align with the line of unity, reflecting a data curation strategy that prioritizes broad multilingual parity over individual language optimization. In contrast, specialized models like Sarvam-1 and Tri-7B exhibit a clear departure from this trend, appearing above the line of unity for their target languages. However, their aggregate multilingual performance (shown along x-axis) reveals a substantial degradation in overall capabilities. This highlights that these models have traded general multilingualism for localized expertise. Notably, models curated with DatologyAI achieve competitive results without necessitating such performance tradeoffs. Finally, Figure 9 visualizes the performance on various individual languages as a function of the number of training tokens in that language for DatologyAI models and the subset of the models we evaluated where we could obtain reasonable estimates for the per-language training tokens (we describe our methodology in Appendix A.4). This figure clearly visualizes the orders-of-magnitude improvements in per-language data efficiency obtained with DatologyAI curation.

Taken together, the results in Figure 1 and Figures 6-9 demonstrate that DatologyAI multilingual data curation is both highly effective and scales to the frontier model training regime. The latter point is reinforced by results from Trinity Large, which was pretrained on 17T tokens drawn from the broader DatologyAI-curated corpus and exhibits exceptionally strong multilingual performance.

5. Conclusion

Multilinguality is an essential capability for modern foundation models, yet achieving high-quality multilingual performance with broad coverage remains challenging due to uneven data availability across languages and the so-called "curse of multilinguality". In this work, we revisit multilingual pretraining from a data-centric perspective and show that many of observed constraints and regressions are not inherent to multilinguality, but instead reflect deficiencies in data quality and curation.

Through controlled bilingual experiments, we demonstrate that cross-lingual transfer is strongly mediated by data quality: improving English curation alone yields consistent gains across nearly all non-English languages examined (3.91% average relative improvement across multilingual MMLU, ARC-Challenge, and Belebele), while improving non-English curation reciprocally benefits English performance (1.21% average relative improvement). These findings challenge a purely zero-sum framing of multilingual modeling: higher-quality training data can strengthen multilingual capability without requiring commensurate sacrifices elsewhere in the mixture.

English curation alone, however, is insufficient for optimal performance. Bespoke, per-language pipelines tailored to linguistic and distributional properties deliver substantially larger gains, reaching 16.87% relative improvement in controlled settings. We further show that translation is most effective when it preserves source quality: translating score-filtered English documents yields materially larger gains than translating arbitrary text, and integrating high-quality document translation as part of a holistic multilingual curation strategy yields far superior results overall.

We productionized these principles through a 20T-token general-purpose pretraining corpus, whose multilingual component was constructed using the curation strategies explored here. Under a controlled 1T-token training budget, 3B and 8B models achieve comparable or stronger multilingual performance than competitive open-weight baselines at 4–10× lower training compute, redefining the multilingual performance–compute Pareto frontier. These efficiency gains persist at frontier scale: Trinity Large Base (400B/A13B), trained on 17T tokens of this corpus, exhibits exceptionally strong multilingual performance relative to its FLOPs budget, validating that the curation principles described here remain effective in the multi–tens-of-trillions regime. We emphasize that for both the 1T-token training budget experiments as well as for Trinity Large, the multilingual performance is obtained using a comparatively minor multilingual token budget of 7.75% of total training tokens.

Several avenues for future work follow. Our results motivate more systematic, compute-aware mixture design, including per-language sampling strategies and phased curricula that balance improvements in one language against interference in others while ensuring adequate support for low-resource languages. Scaling this agenda will likely require more robust multilingual evaluation frameworks [98]. Finally, extending these data-centric principles to multimodal and vision–language model (VLM) training remains an important direction, where evaluation quality and coverage are also central bottlenecks [40].

In conclusion, viewed through the data-centric lens advanced in this work, multilinguality need not be a curse of scale, but instead an opportunity to leverage language-aware curation to achieve inclusive, capable foundation models.

Get in Touch!

If you're interested in pushing the bounds of what's possible with data curation, we're looking for talented Members of Technical Staff who have experience doing data research, building research tooling, translating science into products, and building scalable data products.

If you're interested in training multimodal and/or text models faster, better, or smaller, Get in touch!

Contributions and Acknowledgements

Core and technical contributors listed alphabetically.

Core Contributors: Aldo Gael Carranza, Kaleigh Mentzer, and Ricardo Pio Monti
Technical Contributors: Alex Fang, Alvin Deng, Amro Abbas, Anshuman Suri, Brett Larsen, Cody Blakeney, Darren Teh, David Schwab, Diego Kiner, Fan Pan, Haakon Mongstad, Haoli Yin, Jack Urbanek, Jason Lee, Jason Telanoff, Josh Wills, Luke Merrick, Maximilian Böther, Parth Doshi, Paul Burstein, Pratyush Maini, Rishabh Adaiga, Sid Joshi, Spandan Das, Tony Jiang, Vineeth Dorna, and Zhengping Wang
Leadership: Bogdan Gaza, Ari Morcos, and Matthew Leavitt
Acknowledgements: Liz Gatapia, Jacqueline Liu, Tiffanie Pham, Sylvia Hoang, Kylie Clement, Elise Clark

Appendix

A.1 Evaluation datasets per language

In the table below, Global MMLU is the dataset provided by [29] while Indic MMLU refers to the translation into Indic languages (available here). For ARC evaluations, we rely on evaluation datasets released as part of the Okapi framework [28]. This contains evaluations for the majority of the languages we consider, with the exception of Korean, Portuguese, Hindi and Bengali. For Korean, we use the Ko-ARC evaluation [27], for Portuguese we use the translated version provided by LumiOpen (available here). Finally, for Indic ARC evaluations we use Indic ARC (available here). In the case of Belebele, the original evaluation dataset supports all our languages [24] (available here).

A.2 Details of FLOP budget computations for open-source models

We summarize the training compute (in FLOPs) for each open-source baseline reported in Figure 1. Throughout this work, we estimate training FLOPs using the standard approximation:

Total FLOPs = 6 x N x D,

where N is the number of (trainable) parameters and D is the number of training tokens. In the table, B=10⁹ and T=10¹². For MoE models, we use the active parameter count per token as N.

A.3 Per language evaluation performance

Figure 6: Per-language performance vs. training compute for latin-script languages. Rows correspond to Spanish, Portuguese, French, German, Vietnamese, and Indonesian. Columns 1–3 report performance on MMLU, ARC, and Belebele as a function of training FLOPs (x-axis). The rightmost column compares each model’s language-specific average score (y-axis) to its all-language average across multilingual evaluations (x-axis); the dashed line indicates parity (y = x).

Figure 7: Per-language performance vs. training compute for Indic and Arabic-script languages. Rows correspond to Hindi, Bengali, and Arabic. Columns 1–3 report performance on MMLU, ARC, and Belebele as a function of training FLOPs (x-axis). The rightmost column compares each model’s language-specific average score (y-axis) to its all-language average across multilingual evaluations (x-axis); the dashed line indicates parity (y = x).

Figure 8: Per-language performance vs. training compute for CJK languages and Russian. Rows correspond to Chinese, Japanese, Korean, and Russian. Columns 1–3 report performance on MMLU, ARC, and Belebele as a function of training FLOPs (x-axis). The rightmost column compares each model’s language-specific average score (y-axis) to its all-language average across multilingual evaluations (x-axis); the dashed line indicates parity (y = x). We note that there is no ARC Challenge evaluation available for Japanese.

A.4 Multilingual data efficiency gains

In this section, we quantify multilingual performance as a function of the training token count dedicated to each language. Conducting this analysis is challenging for several reasons. First, several of the evaluated models use their own tokenizers, which makes token counts imperfectly comparable across models. Second, precise per-language token counts are often unavailable for open-source models, and so we aim to estimate them as best we can. We nevertheless include this analysis because DatologyAI curation yields improvements in multilingual data efficiency that are large, often by orders of magnitude, so the qualitative conclusion is robust even under reasonable uncertainty in these estimates.

In this section we only include models for which we could obtain a reasonably reliable estimate of per-language tokens using public information. These models are:

SmolLM3: This model used 12% multilingual data over 11T, supporting a range of languages including Spanish, German, French and Portuguese. We compute the amount of tokens per language directly from the configurations which were publicly shared (available here).
Llama3.2: This model used 8% multilingual data over 9T, supporting seven languages. This is approximately 100B tokens per language.
Sarvam-1: This was was trained on a 2T Indic language corpus, which contained 20% Hindi tokens and 10% Bengali tokens. This corresponds to 200B and 100B tokens for each language respectively.
Trillion Labs 7B: this model was trained on a 2T dataset, 10% of which was multilingual with a primary focus on Korean. As such, we estimated this model was trained with approximately 200B Korean tokens.
DatologyAI: as referenced in section 4, DatologyAI models were trained for 1T tokens with a 7.75% multilingual component. This corresponds to a total of 75B multilingual tokens across thirteen languages, approximately 6B tokens per language.

Figure 9 visualizes the performance across various languages as a function of estimated tokens in that language. We observe that DatologyAI curation has orders of magnitude gains in token efficiency.

Figure 9: Per-Language Performance vs. Multilingual Training Tokens. We visualize the number of language-specific training tokens (x-axis, billions) and the average downstream performance across a range of models. We only include models where the number of tokens per langauge could be reasonably estimated based on public information. DatologyAI models were trained with only 6B tokens per language (7.75\% mutlilingual overall). The plots demonstrate significant data efficiency improvements from DatologyAI curation compared to open-source baselines such as Llama-3.2, SmolLM2, and language-specific models like Sarvam-2B. We add an asterix beside the name of all non-DatologyAI models, to highlight that we estimated the number of tokens per language to the best of our ability based on publicly available information.

References

[1]Hyung Won Chung, Xavier Garcia, Adam Roberts, Yi Tay, Orhan Firat, Sharan Narang, Noah Constant. "UniMax: Fairer and More Effective Language Sampling for Large-Scale Multilingual Pretraining." *The Eleventh International Conference on Learning Representations* (2023)

[2]Xue, Linting, Constant, Noah, Roberts, Adam, Kale, Mihir, Al-Rfou, Rami, Siddhant, Aditya, Barua, Aditya, Raffel, Colin. "mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer." *Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies* (2021): 483–498 Link

[3]Qwen Team. "Qwen2.5 Technical Report." (2024) Link

[4]Ortiz Suárez, Pedro Javier, Romary, Laurent, Sagot, Benoît. "A Monolingual Approach to Contextualized Word Embeddings for Mid-Resource Languages." *Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics* (2020): 1703–1714 Link

[5]IBM Granite Team. "Granite 3.0 Language Models." (2024) Link

[6]Bakouch, Elie, Ben Allal, Loubna, Lozhkov, Anton, Tazi, Nouamane, Tunstall, Lewis, Patiño, Carlos Miguel, Beeching, Edward, Roucher, Aymeric, Reedi, Aksel Joonas, Gallouédec, Quentin, Rasul, Kashif, Habib, Nathan, Fourrier, Clémentine, Kydlicek, Hynek, Penedo, Guilherme, Larcher, Hugo, Morlon, Mathieu, Srivastav, Vaibhav, Lochner, Joshua, Nguyen, Xuan-Son, Raffel, Colin, von Werra, Leandro, Wolf, Thomas. "SmolLM3: smol, multilingual, long-context reasoner." (2025)

[7]Laurençon, Hugo, Saulnier, Lucile, et al.. "The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset." *Proceedings of the 36th Conference on Neural Information Processing Systems (NeurIPS 2022) Datasets and Benchmarks Track* (2022) Link

[8]Feng, Fangxiaoyu, Yang, Yinfei, Cer, Daniel, Arivazhagan, Naveen, Wang, Wei. "Language-agnostic BERT Sentence Embedding." *Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)* (2022): 870–883

[9]Reimers, Nils, Gurevych, Iryna. "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks." *Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)* (2019): 3982–3992

[10]Wang, Liang, Yang, Nan, Huang, Xiaolong, Jiao, Binxing, Yang, Linjun, Jiang, Daxin, Majumder, Rangan, Wei, Furu. "Text Embeddings by Weakly-Supervised Contrastive Pre-training." *arXiv preprint arXiv:2212.03533* (2026)

[11]Kudugunta, Sneha, Caswell, Isaac, Zhang, Biao, Garcia, Xavier, Xin, Derrick, Kusupati, Aditya, Stella, Romi, Bapna, Ankur, Firat, Orhan. "Madlad-400: A multilingual and document-level large audited dataset." *Advances in Neural Information Processing Systems* 36 (2026): 67284–67296

[12]Guilherme Penedo, Hynek Kydlíček, Vinko Sabolčec, Bettina Messmer, Negar Foroutan, Amir Hossein Kargaran, Colin Raffel, Martin Jaggi, Leandro Von Werra, Thomas Wolf. "FineWeb2: One Pipeline to Scale Them All — Adapting Pre-Training Data Processing to Every Language." *Second Conference on Language Modeling* (2025) Link

[13]Pfeiffer, Jonas, Vulić, Ivan, Gurevych, Iryna, Ruder, Sebastian. "MAD-X: An Adapter-Based Framework for Multi-Task Cross-Lingual Transfer." *Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)* (2020): 7654–7673 Link

[14]Pfeiffer, Jonas, Goyal, Naman, Lin, Xi Victoria, Li, Xian, Cross, James, Riedel, Sebastian, Artetxe, Mikel. "Lifting the Curse of Multilinguality by Pre-training Modular Transformers." *Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies* (2022): 3479–3495 Link

[15]Hestness, Joel, Narang, Sharan, Ardalani, Newsha, Diamos, Gregory, Jun, Heewoo, Kianinejad, Hyuseung, Patwary, Md Mostofa Ali, Yang, Yang, Zhou, Yanqi. "Deep Learning Scaling is Predictable, Empirically." *arXiv preprint arXiv:1712.00409* (2017) Link

[16]Khan, Mohammed, Mehta, Priyam, Sankar, Ananth, Kumaravelan, Umashankar, Doddapaneni, Sumanth, Jain, Sparsh, Kunchukuttan, Anoop, Kumar, Pratyush, Dabre, Raj, Khapra, Mitesh M, others. "IndicLLMSuite: A Blueprint for Creating Pre-training and Fine-Tuning Datasets for Indian Languages." *Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)* (2026): 15831–15879

[17]Singh, Shivalika, Vargus, Freddie, Dsouza, Daniel, Karlsson, Börje F., Mahendiran, Abinaya, Ko, Wei-Yin, Shandilya, Herumb, Patel, Jay, Mataciunas, Deividas, O’Mahony, Laura, others. "Aya Dataset: An Open-Access Collection for Multilingual Instruction Tuning." *Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)* (2024): 11521–11567 Link

[18]Xue, Linting, Barua, Aditya, Constant, Noah, Al-Rfou, Rami, Narang, Sharan, Kale, Mihir, Roberts, Adam, Raffel, Colin. "ByT5: Towards a token-free future with pre-trained byte-to-byte models." *Transactions of the Association for Computational Linguistics* 10 (2022): 291–306 Link

[19]Iacob, Andrei, et al.. "DEPT: Decoupled Embeddings for Pre-training Transformers." *arXiv preprint arXiv:2501.00987* (2025) Link

[20]Wang, Zirui, Lipton, Zachary C., Tsvetkov, Yulia. "Negative Interference in Multilingual Models: Findings and A Meta-Learning Treatment." *Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)* (2020): 4373–4388 Link

[21]Foroutan, Negar, Teiletche, Paul, Tarun, Ayush Kumar, Bosselut, Antoine. "Revisiting Multilingual Data Mixtures in Language Model Pretraining." *arXiv preprint arXiv:2510.25947* (2025) Link

[22]Guilherme Penedo, Hynek Kydlíček, Amir Hossein Kargaran, Leandro von Werra. "FineTranslations." *Hugging Face repository* (2026)

[23]Conneau, Alexis, Khandelwal, Kartikay, Goyal, Naman, Chaudhary, Vishrav, Wenzek, Guillaume, Guzmán, Francisco, Grave, Edouard, Ott, Myle, Zettlemoyer, Luke, Stoyanov, Veselin. "Unsupervised cross-lingual representation learning at scale." *Proceedings of the 58th annual meeting of the association for computational linguistics* (2026): 8440–8451

[24]Bandarkar, Lucas, Liang, Davis, Muller, Benjamin, Artetxe, Mikel, Shukla, Satya Narayan, Husa, Donald, Goyal, Naman, Krishnan, Abhinandan, Zettlemoyer, Luke, Khabsa, Madian. "The belebele benchmark: a parallel reading comprehension dataset in 122 language variants." *Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)* (2026): 749–775

[25]He, Yifei, Benhaim, Alon, Patra, Barun, Vaddamanu, Praneetha, Ahuja, Sanchit, Chopra, Parul, Chaudhary, Vishrav, Zhao, Han, Song, Xia. "Scaling laws for multilingual language models." *Findings of the Association for Computational Linguistics: ACL 2025* (2026): 4257–4273

[26]Habib, Nathan, Fourrier, Clémentine, Kydlíček, Hynek, Wolf, Thomas, Tunstall, Lewis. "LightEval: A lightweight framework for LLM evaluation." (2023) Link

[27]Thunder Research Group. "Korean Benchmarks." (2025)

[28]Lai, Viet, Nguyen, Chien, Ngo, Nghia, Nguyễn, Thuật, Dernoncourt, Franck, Rossi, Ryan, Nguyen, Thien. "Okapi: Instruction-tuned large language models in multiple languages with reinforcement learning from human feedback." *Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations* (2026): 318–327

[29]Singh, Shivalika, Romanou, Angelika, Fourrier, Clémentine, Adelani, David Ifeoluwa, Ngui, Jian Gang, Vila-Suero, Daniel, Limkonchotiwat, Peerat, Marchisio, Kelly, Leong, Wei Qi, Susanto, Yosephine, others. "Global mmlu: Understanding and addressing cultural and linguistic biases in multilingual evaluation." *Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)* (2026): 18761–18799

[30]Longpre, Shayne, Kudugunta, Sneha, Muennighoff, Niklas, Hsu, I-Hung, Caswell, Isaac, Pentland, Alex, Arik, Sercan, Lee, Chen-Yu, Ebrahimi, Sayna. "ATLAS: Adaptive Transfer Scaling Laws for Multilingual Pretraining, Finetuning, and Decoding the Curse of Multilinguality." *arXiv preprint arXiv:2510.22037* (2025) Link

[31]Kaplan, Jared, McCandlish, Sam, Henighan, Tom, Brown, Tom B, Chess, Benjamin, Child, Rewon, Gray, Scott, Radford, Alec, Wu, Jeffrey, Amodei, Dario. "Scaling Laws for Neural Language Models." *arXiv preprint arXiv:2001.08361* (2020) Link

[32]Hoffmann, Jordan, Borgeaud, Sebastian, Mensch, Arthur, Buchatskaya, Elena, Cai, Trevor, Rutherford, Eliza, de Las Casas, Diego, Hendricks, Lisa Anne, Welbl, Johannes, Clark, Aidan, others. "Training compute-optimal large language models." *Proceedings of the 36th International Conference on Neural Information Processing Systems* (2026): 30016–30030

[33]Fernandes, Patrick, Ghorbani, Behrooz, Garcia, Xavier, Freitag, Markus, Firat, Orhan. "Scaling laws for multilingual neural machine translation." *International Conference on Machine Learning* (2023): 10053–10071

[34]Lai, Viet, Nguyen, Chien, Ngo, Nghia, Nguyen, Thuat, Dernoncourt, Franck, Rossi, Ryan, Nguyen, Thien. "Okapi: Instruction-tuned Large Language Models in Multiple Languages with Reinforcement Learning from Human Feedback." *Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations* (2023): 318–327 Link

[35]Blevins, Terra, Limisiewicz, Tomasz, Gururangan, Suchin, Li, Margaret, Gonen, Hila, Smith, Noah A, Zettlemoyer, Luke. "Breaking the curse of multilinguality with cross-lingual expert language models." *Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing* (2026): 10822–10837 Link

[36]Chang, Tyler A., Arnett, Catherine, Tu, Zhuowen, Bergen, Benjamin K.. "When Is Multilinguality a Curse? Language Modeling for 250 High- and Low-Resource Languages." *Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing* (2024): 4074–4096 Link

[37]Khanna, Saurabh, Li, Xinxu. "Invisible Languages of the LLM Universe." *arXiv preprint arXiv:2510.11557* (2025) Link

[38]Li, Jeffrey, Fang, Alex, Smyrnis, Georgios, Ivgi, Maor, Jordan, Matt, Gadre, Samir Yitzhak, Bansal, Hritik, Guha, Etash, Keh, Sedrick Scott, Arora, Kushal, others. "Datacomp-lm: In search of the next generation of training sets for language models." *Advances in Neural Information Processing Systems* 37 (2026): 14200–14282

[39]Singh, Varun, Krauss, Lucas, Jaghouar, Sami, Sirovatka, Matej, Goddard, Charles, Obeid, Fares, Ong, Jack Min, Straube, Jannik, Fern, Harley, Aria, others. "Arcee Trinity Large Technical Report." (2026) Link

[40]Joshi, Siddharth, Yin, Haoli, Adiga, Rishabh, Monti, Ricardo, Carranza, Aldo, Fang, Alex, Deng, Alvin, Abbas, Amro, Larsen, Brett, Blakeney, Cody, others. "DatBench: Discriminative, Faithful, and Efficient VLM Evaluations." *arXiv preprint arXiv:2601.02316* (2026)

[41]Merrick, Luke, Fang, Alex, Carranza, Aldo, Deng, Alvin, Abbas, Amro, Larsen, Brett, Blakeney, Cody, Teh, Darren, Schwab, David, Pan, Fan, others. "Luxical: High-Speed Lexical-Dense Text Embeddings." *arXiv preprint arXiv:2512.09015* (2026)

[42]Seto, Skyler, Ter Hoeve, Maartje, de Seyssel, Maureen, Grangier, David. "Assessing the Role of Data Quality in Training Bilingual Language Models." *Findings of the Association for Computational Linguistics: EMNLP 2025* (2025): 22694–22720 Link

[43]Su, Dan, Kong, Kezhi, Lin, Ying, Jennings, Joseph, Norick, Brandon, Kliegl, Markus, Patwary, Mostofa, Shoeybi, Mohammad, Catanzaro, Bryan. "Nemotron-cc: Transforming common crawl into a refined long-horizon pretraining dataset." *Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)* (2026): 2459–2475

[44]Olmo, Team, Ettinger, Allyson, Bertsch, Amanda, Kuehl, Bailey, Graham, David, Heineman, David, Groeneveld, Dirk, Brahman, Faeze, Timbers, Finbarr, Ivison, Hamish, others. "Olmo 3." *arXiv preprint arXiv:2512.13961* (2026)

[45]Touvron, Hugo, Lavril, Thibaut, Izacard, Gautier, Martinet, Xavier, Lachaux, Marie-Anne, Lacroix, Timothée, Rozière, Baptiste, Goyal, Naman, Hambro, Eric, Azhar, Faisal, others. "Llama: Open and efficient foundation language models." *arXiv preprint arXiv:2302.13971* (2026)

[46]Choi, Dami, Xin, Derrick, Dadkhahi, Hamid, Gilmer, Justin, Garg, Ankush, Firat, Orhan, Yeh, Chih-Kuan, Dai, Andrew M, Ghorbani, Behrooz. "Order matters in the presence of dataset imbalance for multilingual learning." *Advances in Neural Information Processing Systems* 36 (2019): 66902–66922 Link

[47]Wendler, Chris, Veselovsky, Veniamin, Monea, Giovanni, West, Robert. "Do llamas work in english? on the latent language of multilingual transformers." *Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)* (2026): 15366–15394

[48]Ahuja, Sanchit, Aggarwal, Divyanshu, Gumma, Varun, Watts, Ishaan, Sathe, Ashutosh, Ochieng, Millicent, Hada, Rishav, Jain, Prachi, Ahmed, Mohamed, Bali, Kalika, others. "Megaverse: Benchmarking large language models across languages, modalities, models and tasks." *Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)* (2026): 2598–2637

[49]Sorscher, Ben, Geirhos, Robert, Shekhar, Shashank, Ganguli, Surya, Morcos, Ari. "Beyond neural scaling laws: beating power law scaling via data pruning." *Advances in Neural Information Processing Systems* 35 (2026): 19523–19536

[50]Messmer, Bettina, Sabolčec, Vinko, Jaggi, Martin. "Enhancing Multilingual LLM Pretraining with Model-Based Data Selection." *Proceedings of the 10th edition of the Swiss Text Analytics Conference* (2025) Link

[51]Wang, Jiayi, Lu, Yao, Weber, Maurice, Ryabinin, Max, Adelani, David, Chen, Yihong, Tang, Raphael, Stenetorp, Pontus. "Multilingual Language Model Pretraining using Machine-translated Data." (2025) Link

[52]Goyal, Naman, Gao, Cynthia, Chaudhary, Vishrav, Chen, Peng-Jen, Wenzek, Guillaume, Ju, Da, Krishnan, Sanjana, Ranzato, Marc’Aurelio, Guzmán, Francisco, Fan, Angela. "The flores-101 evaluation benchmark for low-resource and multilingual machine translation." *Transactions of the Association for Computational Linguistics* 10 (2022): 522–538

[53]Üstün, Ahmet, Aryabumi, Viraat, Yong, Zheng, Ko, Wei-Yin, D’souza, Daniel, Onilude, Gbemileke, Bhandari, Neel, Singh, Shivalika, Ooi, Hui-Lee, Kayid, Amr, others. "Aya Model: An Instruction Finetuned Open-Access Multilingual Language Model." *arXiv preprint arXiv:2402.06619* (2024) Link

[54]Chandak, Nikhil, Goel, Shashwat, Prabhu, Ameya, Hardt, Moritz, Geiping, Jonas. "Answer Matching Outperforms Multiple Choice for Language Model Evaluation." *arXiv preprint arXiv:2507.02856* (2026)

[55]Rahmanzadehgervi, Pooyan, Bolton, Logan, Taesiri, Mohammad Reza, Nguyen, Anh. "Vision language models are blind." *arXiv preprint arXiv:2407.06581* (2026)

[56]Schick, Timo, others. "Fluid Language Model Benchmarking." *arXiv preprint arXiv:2509.11106* (2026)

[57]Unknown Author Team. "Pretraining on the Test Set Is No Longer All You Need: A Debate-Driven Approach to QA Benchmarks." *arXiv preprint arXiv:2507.17747* (2026)

[58]Lord, Frederic M. "Applications of item response theory to practical testing problems." (1980)

[59]Li, Xian, Li, Yu, Zhang, Rui, Zhou, Jie, Sun, Maosong. "Can Multiple-Choice Questions Really Be Useful in Detecting the Abilities of LLMs?." *Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING)* (2024) Link

[60]Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty, Enamul Hoque. "ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning." (2022) Link

[61]Zirui Wang, Mengzhou Xia, Luxi He, Howard Chen, Yitao Liu, Richard Zhu, Kaiqu Liang, Xindi Wu, Haotian Liu, Sadhika Malladi, Alexis Chevalier, Sanjeev Arora, Danqi Chen. "CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs." (2024) Link

[62]Ahmed Masry, Mohammed Saidul Islam, Mahir Ahmed, Aayush Bajaj, Firoz Kabir, Aaryaman Kartha, Md Tahmid Rahman Laskar, Mizanur Rahman, Shadikur Rahman, Mehrad Shahmohammadi, Megh Thakkar, Md Rizwan Parvez, Enamul Hoque, Shafiq Joty. "ChartQAPro: A More Diverse and Challenging Benchmark for Chart Question Answering." (2025) Link

[63]Minesh Mathew, Viraj Bagal, Rubèn Pérez Tito, Dimosthenis Karatzas, Ernest Valveny, C. V Jawahar. "InfographicVQA." (2021) Link

[64]Anand Mishra, Shashank Shekhar, Ajeet Kumar Singh, Anirban Chakraborty. "OCR-VQA: Visual Question Answering by Reading Text in Images." *ICDAR* (2019)

[65]Zhibo Yang, Jun Tang, Zhaohai Li, Pengfei Wang, Jianqiang Wan, Humen Zhong, Xuejing Liu, Mingkun Yang, Peng Wang, Shuai Bai, LianWen Jin, Junyang Lin. "CC-OCR: A Comprehensive and Challenging OCR Benchmark for Evaluating Large Multimodal Models in Literacy." (2024) Link

[66]Minesh Mathew, Dimosthenis Karatzas, C. V. Jawahar. "DocVQA: A Dataset for VQA on Document Images." (2021) Link

[67]Zhang, Yi-Fan, Zhang, Huanyu, Tian, Haochen, Fu, Chaoyou, Zhang, Shuangqing, Wu, Junfei, Li, Feng, Wang, Kun, Wen, Qingsong, Zhang, Zhang, others. "MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?." *arXiv preprint arXiv:2408.13257* (2026)

[68]Singh, Amanpreet, Natarjan, Vivek, Shah, Meet, Jiang, Yu, Chen, Xinlei, Parikh, Devi, Rohrbach, Marcus. "Towards VQA Models That Can Read." *Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition* (2026): 8317-8326

[69]Penedo, Guilherme, Kydlíček, Hynek, Lozhkov, Anton, Mitchell, Margaret, Raffel, Colin A, Von Werra, Leandro, Wolf, Thomas, others. "The fineweb datasets: Decanting the web for the finest text data at scale." *Advances in Neural Information Processing Systems* 37 (2026): 30811–30849

[70]Joshi, Siddharth, Mirzasoleiman, Baharan. "Data-Efficient Contrastive Self-supervised Learning: Most Beneficial Examples for Supervised Learning Contribute the Least." *Proceedings of the 40th International Conference on Machine Learning* (2023): 15356–15370 Link

[71]Joshi, Siddharth, Jain, Arnav, Payani, Ali, Mirzasoleiman, Baharan. "Data-Efficient Contrastive Language-Image Pretraining: Prioritizing Data Quality over Quantity." *Proceedings of The 27th International Conference on Artificial Intelligence and Statistics* (2024): 1000–1008 Link

[72]Alex Fang, Albin Madappally Jose, Amit Jain, Ludwig Schmidt, Alexander Toshev, Vaishaal Shankar. "Data Filtering Networks." (2023) Link

[73]Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, Ed H. Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, William Fedus. "Emergent Abilities of Large Language Models." (2022) Link

[74]Liu, Yuan, Duan, Haodong, Zhang, Yuanhan, Li, Bo, Zhang, Songyang, Zhao, Wangbo, Yuan, Yike, Wang, Jiaqi, He, Conghui, Liu, Ziwei, others. "Mmbench: Is your multi-modal model an all-around player?." *European conference on computer vision* (2024): 216–233

[75]Bean, Andrew M, Seedat, Nabeel, Chen, Shengzhuang, Schwarz, Jonathan Richard. "Scales++: Compute Efficient Evaluation Subset Selection with Cognitive Scales Embeddings." *arXiv preprint arXiv:2510.26384* (2026)

[76]Vivek, Rajan, Ethayarajh, Kawin, Yang, Diyi, Kiela, Douwe. "Anchor points: Benchmarking models with much fewer examples." *Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)* (2026): 1576–1601

[77]Polo, Felipe Maia, Weber, Lucas, Choshen, Leshem, Sun, Yuekai, Xu, Gongjun, Yurochkin, Mikhail. "tinyBenchmarks: evaluating LLMs with fewer examples." *arXiv preprint arXiv:2402.14992* (2026)

[78]Kipnis, Alex, Voudouris, Konstantinos, Buschoff, Luca M Schulze, Schulz, Eric. "metabench--A Sparse Benchmark of Reasoning and Knowledge in Large Language Models." *arXiv preprint arXiv:2407.12844* (2026)

[79]Tate, Robert F. "Correlation between a discrete and a continuous variable. Point-biserial correlation." *The Annals of mathematical statistics* 25, no. 3 (1954): 603–607

[80]Wang, Jiayu, Ming, Yifei, Shi, Zhenmei, Vineet, Vibhav, Wang, Xin, Li, Sharon, Joshi, Neel. "Is a picture worth a thousand words? delving into spatial reasoning for vision language models." *Advances in Neural Information Processing Systems* 37 (2026): 75392–75421

[81]Lee, Kang-il, Kim, Minbeom, Yoon, Seunghyun, Kim, Minsung, Lee, Dongryeol, Koh, Hyukhun, Jung, Kyomin. "VLind-Bench: Measuring Language Priors in Large Vision-Language Models." *Findings of the Association for Computational Linguistics: NAACL 2025* (2025): 4129–4144 Link

[82]Jian Li, Weiheng Lu, Hao Fei, Meng Luo, Ming Dai, Min Xia, Yizhang Jin, Zhenye Gan, Ding Qi, Chaoyou Fu, Ying Tai, Wankou Yang, Yabiao Wang, Chengjie Wang. "A Survey on Benchmarks of Multimodal Large Language Models." (2024) Link

[83]YiFan Zhang, Yang Shi, Weichen Yu, Qingsong Wen, Xue Wang, Wenjing Yang, Zhang Zhang, Liang Wang, Rong Jin. "Debiasing Multimodal Large Language Models via Penalization of Language Priors." (2025) Link

[84]Zhiqiu Lin, Xinyue Chen, Deepak Pathak, Pengchuan Zhang, Deva Ramanan. "Revisiting the Role of Language Priors in Vision-Language Models." (2024) Link

[85]Guan, Jian, Dodge, Jesse, Wadden, David, Huang, Minlie, Peng, Hao. "Language Models Hallucinate, but May Excel at Fact Verification." *Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)* (2024) Link

[86]Liao, Yuan-Hong, Mahmood, Rafid, Fidler, Sanja, Acuna, David. "Can Large Vision-Language Models Correct Semantic Grounding Errors By Themselves?." *Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)* (2025): 14667-14678

[87]Saad-Falcon, Jon, Buchanan, E Kelly, Chen, Mayee F, Huang, Tzu-Heng, McLaughlin, Brendan, Bhathal, Tanvir, Zhu, Shang, Athiwaratkun, Ben, Sala, Frederic, Linderman, Scott, others. "Shrinking the Generation-Verification Gap with Weak Verifiers." *arXiv preprint arXiv:2506.18203* (2026)

[88]V Venktesh, Mandeep Rathee, Avishek Anand. "Trust but Verify! A Survey on Verification Design for Test-time Scaling." (2025) Link

[89]Lord, Frederic M. "A theory of test scores." *Psychometrika measures* 7, no. 1 (2026)

[90]Baker, Frank B. "The basics of item response theory." (2001)

[91]Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan Liu, Dunjie Lu, Ruilin Luo, Chenxu Lv, Rui Men, Lingchen Meng, Xuancheng Ren, Xingzhang Ren, Sibo Song, Yuchong Sun, Jun Tang, Jianhong Tu, Jianqiang Wan, Peng Wang, Pengfei Wang, Qiuyue Wang, Yuxuan Wang, Tianbao Xie, Yiheng Xu, Haiyang Xu, Jin Xu, Zhibo Yang, Mingkun Yang, Jianxin Yang, An Yang, Bowen Yu, Fei Zhang, Hang Zhang, Xi Zhang, Bo Zheng, Humen Zhong, Jingren Zhou, Fan Zhou, Jing Zhou, Yuanzhi Zhu, Ke Zhu. "Qwen3-VL Technical Report." (2025) Link

[92]Lu, Pan, Bansal, Hritik, Xia, Tony, Liu, Jiacheng, Li, Chunyuan, Hajishirzi, Hannaneh, Cheng, Hao, Chang, Kai-Wei, Galley, Michel, Gao, Jianfeng. "MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts." *International Conference on Learning Representations (ICLR)* (2026)

[93]Yijia Xiao, Edward Sun, Tianyu Liu, Wei Wang. "LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts." (2024) Link

[94]Kazemzadeh, Sahar, Ordonez, Vicente, Matten, Mark, Berg, Tamara. "ReferItGame: Referring to Objects in Photographs of Natural Scenes." *Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)* (2014) Link

[95]Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tripathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, Jiasen Lu, Taira Anderson, Erin Bransom, Kiana Ehsani, Huong Ngo, YenSung Chen, Ajay Patel, Mark Yatskar, Chris Callison-Burch, Andrew Head, Rose Hendrix, Favyen Bastani, Eli VanderBilt, Nathan Lambert, Yvonne Chou, Arnavi Chheda, Jenna Sparks, Sam Skjonsberg, Michael Schmitz, Aaron Sarnat, Byron Bischoff, Pete Walsh, Chris Newell, Piper Wolters, Tanmay Gupta, Kuo-Hao Zeng, Jon Borchardt, Dirk Groeneveld, Crystal Nam, Sophie Lebrecht, Caitlin Wittlif, Carissa Schoenick, Oscar Michel, Ranjay Krishna, Luca Weihs, Noah A. Smith, Hannaneh Hajishirzi, Ross Girshick, Ali Farhadi, Aniruddha Kembhavi. "Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models." (2024) Link

[96]Sarvam AI. "Sarvam 1: The first Indian language LLM." (2024) Link

[97]Raymond Ng, Thanh Ngan Nguyen, Yuli Huang, Ngee Chia Tai, Wai Yi Leong, Wei Qi Leong, Xianbin Yong, Jian Gang Ngui, Yosephine Susanto, Nicholas Cheng, Hamsawardhini Rengarajan, Peerat Limkonchotiwat. "SEA-LION: Southeast Asian Languages in One Network." (2025) Link

[98]Liang, Percy, Bommasani, Rishi, Lee, Tony, Tsipras, Dimitris, Soylu, Dilara, Yasunaga, Michihiro, Zhang, Yian, Narayanan, Deepak, Wu, Yuhuai, Kumar, Ananya, others. "Holistic Evaluation of Language Models." (2022) Link

[99]Roni Paiss, Ariel Ephrat, Omer Tov, Shiran Zada, Inbar Mosseri, Michal Irani, Tali Dekel. "Teaching CLIP to Count to Ten." (2023) Link

[100]Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, Ali Farhadi. "A Diagram Is Worth A Dozen Images." (2016) Link

[101]Xiang Yue, Tianyu Zheng, Yuansheng Ni, Yubo Wang, Kai Zhang, Shengbang Tong, Yuxuan Sun, Botao Yu, Ge Zhang, Huan Sun, Yu Su, Wenhu Chen, Graham Neubig. "MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark." (2025) Link

[102]Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, Devi Parikh. "Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering." *Conference on Computer Vision and Pattern Recognition (CVPR)* (2017)

[103]Mao, Junhua, Huang, Jonathan, Toshev, Alexander, Camburu, Oana, Yuille, Alan L., Murphy, Kevin. "Generation and Comprehension of Unambiguous Object Descriptions." *Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)* (2026)

[104]xAI. "RealWorldQA: A Benchmark for Real-World Visual Understanding." (2024)

[105]Siddharth Joshi, Yu Yang, Yihao Xue, Wenhan Yang, Baharan Mirzasoleiman. "Challenges and Opportunities in Improving Worst-Group Generalization in Presence of Spurious Features." (2025) Link

[106]Varma, Maya, Delbrouck, Jean-Benoit, Chen, Zhihong, Chaudhari, Akshay, Langlotz, Curtis. "RaVL: Discovering and Mitigating Spurious Correlations in Fine-Tuned Vision-Language Models." *Advances in Neural Information Processing Systems* (2026): 82235–82264 Link

[107]Siddharth Joshi, Besmira Nushi, Vidhisha Balachandran, Varun Chandrasekaran, Vibhav Vineet, Neel Joshi, Baharan Mirzasoleiman. "MM-GEN: Enhancing Task Performance Through Targeted Multimodal Data Curation." (2025) Link

[108]Amro Abbas, Kushal Tirumala, Dániel Simig, Surya Ganguli, Ari S. Morcos. "SemDeDup: Data-efficient learning at web-scale through semantic deduplication." (2023) Link

[109]Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, Benjamin Newman, Binhang Yuan, Bobby Yan, Ce Zhang, Christian Cosgrove, Christopher D. Manning, Christopher Ré, Diana Acosta-Navas, Drew A. Hudson, Eric Zelikman, Esin Durmus, Faisal Ladhak, Frieda Rong, Hongyu Ren, Huaxiu Yao, Jue Wang, Keshav Santhanam, Laurel Orr, Lucia Zheng, Mert Yuksekgonul, Mirac Suzgun, Nathan Kim, Neel Guha, Niladri Chatterji, Omar Khattab, Peter Henderson, Qian Huang, Ryan Chi, Sang Michael Xie, Shibani Santurkar, Surya Ganguli, Tatsunori Hashimoto, Thomas Icard, Tianyi Zhang, Vishrav Chaudhary, William Wang, Xuechen Li, Yifan Mai, Yuhui Zhang, Yuta Koreeda. "Holistic Evaluation of Language Models." (2023) Link

[110]Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R. Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, Agnieszka Kluska, Aitor Lewkowycz, Akshat Agarwal, Alethea Power, Alex Ray, Alex Warstadt, Alexander W. Kocurek, Ali Safaya, Ali Tazarv, Alice Xiang, Alicia Parrish, Allen Nie, Aman Hussain, Amanda Askell, Amanda Dsouza, Ambrose Slone, Ameet Rahane, Anantharaman S. Iyer, Anders Andreassen, Andrea Madotto, Andrea Santilli, Andreas Stuhlmüller, Andrew Dai, Andrew La, Andrew Lampinen, Andy Zou, Angela Jiang, Angelica Chen, Anh Vuong, Animesh Gupta, Anna Gottardi, Antonio Norelli, Anu Venkatesh, Arash Gholamidavoodi, Arfa Tabassum, Arul Menezes, Arun Kirubarajan, Asher Mullokandov, Ashish Sabharwal, Austin Herrick, Avia Efrat, Aykut Erdem, Ayla Karakaş, B. Ryan Roberts, Bao Sheng Loe, Barret Zoph, Bartłomiej Bojanowski, Batuhan Özyurt, Behnam Hedayatnia, Behnam Neyshabur, Benjamin Inden, Benno Stein, Berk Ekmekci, Bill Yuchen Lin, Blake Howald, Bryan Orinion, Cameron Diao, Cameron Dour, Catherine Stinson, Cedrick Argueta, César Ferri Ramírez, Chandan Singh, Charles Rathkopf, Chenlin Meng, Chitta Baral, Chiyu Wu, Chris Callison-Burch, Chris Waites, Christian Voigt, Christopher D. Manning, Christopher Potts, Cindy Ramirez, Clara E. Rivera, Clemencia Siro, Colin Raffel, Courtney Ashcraft, Cristina Garbacea, Damien Sileo, Dan Garrette, Dan Hendrycks, Dan Kilman, Dan Roth, Daniel Freeman, Daniel Khashabi, Daniel Levy, Daniel Moseguí González, Danielle Perszyk, Danny Hernandez, Danqi Chen, Daphne Ippolito, Dar Gilboa, David Dohan, David Drakard, David Jurgens, Debajyoti Datta, Deep Ganguli, Denis Emelin, Denis Kleyko, Deniz Yuret, Derek Chen, Derek Tam, Dieuwke Hupkes, Diganta Misra, Dilyar Buzan, Dimitri Coelho Mollo, Diyi Yang, Dong-Ho Lee, Dylan Schrader, Ekaterina Shutova, Ekin Dogus Cubuk, Elad Segal, Eleanor Hagerman, Elizabeth Barnes, Elizabeth Donoway, Ellie Pavlick, Emanuele Rodola, Emma Lam, Eric Chu, Eric Tang, Erkut Erdem, Ernie Chang, Ethan A. Chi, Ethan Dyer, Ethan Jerzak, Ethan Kim, Eunice Engefu Manyasi, Evgenii Zheltonozhskii, Fanyue Xia, Fatemeh Siar, Fernando Martínez-Plumed, Francesca Happé, Francois Chollet, Frieda Rong, Gaurav Mishra, Genta Indra Winata, Gerard de Melo, Germán Kruszewski, Giambattista Parascandolo, Giorgio Mariani, Gloria Wang, Gonzalo Jaimovitch-López, Gregor Betz, Guy Gur-Ari, Hana Galijasevic, Hannah Kim, Hannah Rashkin, Hannaneh Hajishirzi, Harsh Mehta, Hayden Bogar, Henry Shevlin, Hinrich Schütze, Hiromu Yakura, Hongming Zhang, Hugh Mee Wong, Ian Ng, Isaac Noble, Jaap Jumelet, Jack Geissinger, Jackson Kernion, Jacob Hilton, Jaehoon Lee, Jaime Fernández Fisac, James B. Simon, James Koppel, James Zheng, James Zou, Jan Kocoń, Jana Thompson, Janelle Wingfield, Jared Kaplan, Jarema Radom, Jascha Sohl-Dickstein, Jason Phang, Jason Wei, Jason Yosinski, Jekaterina Novikova, Jelle Bosscher, Jennifer Marsh, Jeremy Kim, Jeroen Taal, Jesse Engel, Jesujoba Alabi, Jiacheng Xu, Jiaming Song, Jillian Tang, Joan Waweru, John Burden, John Miller, John U. Balis, Jonathan Batchelder, Jonathan Berant, Jörg Frohberg, Jos Rozen, Jose Hernandez-Orallo, Joseph Boudeman, Joseph Guerr, Joseph Jones, Joshua B. Tenenbaum, Joshua S. Rule, Joyce Chua, Kamil Kanclerz, Karen Livescu, Karl Krauth, Karthik Gopalakrishnan, Katerina Ignatyeva, Katja Markert, Kaustubh D. Dhole, Kevin Gimpel, Kevin Omondi, Kory Mathewson, Kristen Chiafullo, Ksenia Shkaruta, Kumar Shridhar, Kyle McDonell, Kyle Richardson, Laria Reynolds, Leo Gao, Li Zhang, Liam Dugan, Lianhui Qin, Lidia Contreras-Ochando, Louis-Philippe Morency, Luca Moschella, Lucas Lam, Lucy Noble, Ludwig Schmidt, Luheng He, Luis Oliveros Colón, Luke Metz, Lütfi Kerem Şenel, Maarten Bosma, Maarten Sap, Maartje ter Hoeve, Maheen Farooqi, Manaal Faruqui, Mantas Mazeika, Marco Baturan, Marco Marelli, Marco Maru, Maria Jose Ramírez Quintana, Marie Tolkiehn, Mario Giulianelli, Martha Lewis, Martin Potthast, Matthew L. Leavitt, Matthias Hagen, Mátyás Schubert, Medina Orduna Baitemirova, Melody Arnaud, Melvin McElrath, Michael A. Yee, Michael Cohen, Michael Gu, Michael Ivanitskiy, Michael Starritt, Michael Strube, Michał Swędrowski, Michele Bevilacqua, Michihiro Yasunaga, Mihir Kale, Mike Cain, Mimee Xu, Mirac Suzgun, Mitch Walker, Mo Tiwari, Mohit Bansal, Moin Aminnaseri, Mor Geva, Mozhdeh Gheini, Mukund Varma T, Nanyun Peng, Nathan A. Chi, Nayeon Lee, Neta Gur-Ari Krakover, Nicholas Cameron, Nicholas Roberts, Nick Doiron, Nicole Martinez, Nikita Nangia, Niklas Deckers, Niklas Muennighoff, Nitish Shirish Keskar, Niveditha S. Iyer, Noah Constant, Noah Fiedel, Nuan Wen, Oliver Zhang, Omar Agha, Omar Elbaghdadi, Omer Levy, Owain Evans, Pablo Antonio Moreno Casares, Parth Doshi, Pascale Fung, Paul Pu Liang, Paul Vicol, Pegah Alipoormolabashi, Peiyuan Liao, Percy Liang, Peter Chang, Peter Eckersley, Phu Mon Htut, Pinyu Hwang, Piotr Miłkowski, Piyush Patil, Pouya Pezeshkpour, Priti Oli, Qiaozhu Mei, Qing Lyu, Qinlang Chen, Rabin Banjade, Rachel Etta Rudolph, Raefer Gabriel, Rahel Habacker, Ramon Risco, Raphaël Millière, Rhythm Garg, Richard Barnes, Rif A. Saurous, Riku Arakawa, Robbe Raymaekers, Robert Frank, Rohan Sikand, Roman Novak, Roman Sitelew, Ronan LeBras, Rosanne Liu, Rowan Jacobs, Rui Zhang, Ruslan Salakhutdinov, Ryan Chi, Ryan Lee, Ryan Stovall, Ryan Teehan, Rylan Yang, Sahib Singh, Saif M. Mohammad, Sajant Anand, Sam Dillavou, Sam Shleifer, Sam Wiseman, Samuel Gruetter, Samuel R. Bowman, Samuel S. Schoenholz, Sanghyun Han, Sanjeev Kwatra, Sarah A. Rous, Sarik Ghazarian, Sayan Ghosh, Sean Casey, Sebastian Bischoff, Sebastian Gehrmann, Sebastian Schuster, Sepideh Sadeghi, Shadi Hamdan, Sharon Zhou, Shashank Srivastava, Sherry Shi, Shikhar Singh, Shima Asaadi, Shixiang Shane Gu, Shubh Pachchigar, Shubham Toshniwal, Shyam Upadhyay, Shyamolima, Debnath, Siamak Shakeri, Simon Thormeyer, Simone Melzi, Siva Reddy, Sneha Priscilla Makini, Soo-Hwan Lee, Spencer Torene, Sriharsha Hatwar, Stanislas Dehaene, Stefan Divic, Stefano Ermon, Stella Biderman, Stephanie Lin, Stephen Prasad, Steven T. Piantadosi, Stuart M. Shieber, Summer Misherghi, Svetlana Kiritchenko, Swaroop Mishra, Tal Linzen, Tal Schuster, Tao Li, Tao Yu, Tariq Ali, Tatsu Hashimoto, Te-Lin Wu, Théo Desbordes, Theodore Rothschild, Thomas Phan, Tianle Wang, Tiberius Nkinyili, Timo Schick, Timofei Kornev, Titus Tunduny, Tobias Gerstenberg, Trenton Chang, Trishala Neeraj, Tushar Khot, Tyler Shultz, Uri Shaham, Vedant Misra, Vera Demberg, Victoria Nyamai, Vikas Raunak, Vinay Ramasesh, Vinay Uday Prabhu, Vishakh Padmakumar, Vivek Srikumar, William Fedus, William Saunders, William Zhang, Wout Vossen, Xiang Ren, Xiaoyu Tong, Xinran Zhao, Xinyi Wu, Xudong Shen, Yadollah Yaghoobzadeh, Yair Lakretz, Yangqiu Song, Yasaman Bahri, Yejin Choi, Yichi Yang, Yiding Hao, Yifu Chen, Yonatan Belinkov, Yu Hou, Yufang Hou, Yuntao Bai, Zachary Seid, Zhuoye Zhao, Zijian Wang, Zijie J. Wang, Zirui Wang, Ziyi Wu. "Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models." (2023) Link

[111]Contributors OpenCompass. "OpenCompass: A Universal Evaluation Platform for Foundation Models." (2026)

[112]Adiga, Rishabh, Nushi, Besmira, Chandrasekaran, Varun. "Attention Speaks Volumes: Localizing and Mitigating Bias in Language Models." *Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)* (2025): 26403–26423 Link

[113]Yao Lu, Max Bartolo, Alastair Moore, Sebastian Riedel, Pontus Stenetorp. "Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity." (2022) Link

[114]Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, Vaishaal Shankar. "Do ImageNet Classifiers Generalize to ImageNet?." (2019) Link

[115]Jason Wei, Najoung Kim, Yi Tay, Quoc V. Le. "Inverse scaling can become U-shaped." (2023) Link

[116]Emma Strubell, Ananya Ganesh, Andrew McCallum. "Energy and Policy Considerations for Deep Learning in NLP." (2019) Link

[117]OpenAI. "Introducing GPT-5.2." (2025) Link

[118]Siddharth Joshi, Jiayi Ni, Baharan Mirzasoleiman. "Dataset Distillation via Knowledge Distillation: Towards Efficient Self-Supervised Pre-Training of Deep Networks." (2025) Link

[119]An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, Le Yu, Lianghao Deng, Mei Li, Mingfeng Xue, Mingze Li, Pei Zhang, Peng Wang, Qin Zhu, Rui Men, Ruize Gao, Shixuan Liu, Shuang Luo, Tianhao Li, Tianyi Tang, Wenbiao Yin, Xingzhang Ren, Xinyu Wang, Xinyu Zhang, Xuancheng Ren, Yang Fan, Yang Su, Yichang Zhang, Yinger Zhang, Yu Wan, Yuqiong Liu, Zekun Wang, Zeyu Cui, Zhenru Zhang, Zhipeng Zhou, Zihan Qiu. "Qwen3 Technical Report." (2025) Link

[120]Lele Liao, Qile Zhang, Ruofan Wu, Guanhua Fang. "Toward a unified framework for data-efficient evaluation of large language models." (2025) Link

[121]Spearman, Charles. "The proof and measurement of association between two things." *American Journal of Psychology* (2026)

[122]Voorhees, Ellen M.. "Evaluation by highly relevant documents." *SIGIR* (2026)

[123]Buckley, Chris, Voorhees, Ellen M.. "Retrieval evaluation with incomplete information." *SIGIR* (2026)

[124]Sakai, Tetsuya. "On the reliability of information retrieval metrics." *SIGIR* (2026)

[125]Lambert, Nathan. "Good Researchers Obsess Over Evals: The Story of OLMo 3 (Post-Training), Told Through Evals." (2025) Link

[126]Team Olmo, :, Allyson Ettinger, Amanda Bertsch, Bailey Kuehl, David Graham, David Heineman, Dirk Groeneveld, Faeze Brahman, Finbarr Timbers, Hamish Ivison, Jacob Morrison, Jake Poznanski, Kyle Lo, Luca Soldaini, Matt Jordan, Mayee Chen, Michael Noukhovitch, Nathan Lambert, Pete Walsh, Pradeep Dasigi, Robert Berry, Saumya Malik, Saurabh Shah, Scott Geng, Shane Arora, Shashank Gupta, Taira Anderson, Teng Xiao, Tyler Murray, Tyler Romero, Victoria Graf, Akari Asai, Akshita Bhagia, Alexander Wettig, Alisa Liu, Aman Rangapur, Chloe Anastasiades, Costa Huang, Dustin Schwenk, Harsh Trivedi, Ian Magnusson, Jaron Lochner, Jiacheng Liu, Lester James V. Miranda, Maarten Sap, Malia Morgan, Michael Schmitz, Michal Guerquin, Michael Wilson, Regan Huff, Ronan Le Bras, Rui Xin, Rulin Shao, Sam Skjonsberg, Shannon Zejiang Shen, Shuyue Stella Li, Tucker Wilde, Valentina Pyatkin, Will Merrill, Yapei Chang, Yuling Gu, Zhiyuan Zeng, Ashish Sabharwal, Luke Zettlemoyer, Pang Wei Koh, Ali Farhadi, Noah A. Smith, Hannaneh Hajishirzi. "Olmo 3." (2025) Link

[127]Adhiraj Ghosh, Sebastian Dziadzio, Ameya Prabhu, Vishaal Udandarao, Samuel Albanie, Matthias Bethge. "ONEBench to Test Them All: Sample-Level Benchmarking Over Open-Ended Capabilities." (2025) Link

[128]DatologyAI. "DatologyAI Technical Deep-Dive: Curating Our Way to a Billion-State-of-the-Art Text Dataset." (2024) Link

[129]Yang, An, Li, Anfeng, Yang, Baosong, Zhang, Beichen, Hui, Binyuan, Zheng, Bo, Yu, Bowen, Gao, Chang, Huang, Chengen, Lv, Chenxu, others. "Qwen3 technical report." *arXiv preprint arXiv:2505.09388* (2026)

[130]DatologyAI, Abbas, Amro, Wills, Josh, Yin, Haoli, Burstein, Paul, Cao, Ning, Carranza, Aldo, Deng, Alvin, Goyal, Priya, Maini, Pratyush, McGrath, Joshua, Pan, Fan, Urbanek, Jack, Kada, Vineeth, Razzak, Muhammed, Shah, Vishwa, Veerendranath, Vishruth, Gaza, Bogdan, Morcos, Ari, Leavitt, Matthew. "DatologyAI Technical Deep-Dive: Image-Text Data Curation at the Billion-Sample Scale." (2024) Link

[131]moondream. "RefCOCO-M: Refined Referring Expression Segmentation." (2025) Link

[132]Ling Fu, Zhebin Kuang, Jiajun Song, Mingxin Huang, Biao Yang, Yuzhe Li, Linghao Zhu, Qidi Luo, Xinyu Wang, Hao Lu, Zhang Li, Guozhi Tang, Bin Shan, Chunhui Lin, Qi Liu, Binghong Wu, Hao Feng, Hao Liu, Can Huang, Jingqun Tang, Wei Chen, Lianwen Jin, Yuliang Liu, Xiang Bai. "OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning." (2025) Link

[133]Renrui Zhang, Dongzhi Jiang, Yichi Zhang, Haokun Lin, Ziyu Guo, Pengshuo Qiu, Aojun Zhou, Pan Lu, Kai-Wei Chang, Peng Gao, Hongsheng Li. "MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?." (2024) Link

[134]Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Houxing Ren, Aojun Zhou, Mingjie Zhan, Hongsheng Li. "Measuring Multimodal Mathematical Reasoning with MATH-Vision Dataset." *The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track* (2024) Link

[135]Acharya, Manoj, Kafle, Kushal, Kanan, Christopher. "TallyQA: Answering Complex Counting Questions." *AAAI* (2026)

[136]DatologyAI. "BeyondWeb: Lessons from Scaling Synthetic Data for Trillion-scale Pretraining." (2025) Link

[137]DatologyAI, Pratyush Maini, Vineeth Dorna, Parth Doshi, Aldo Carranza, Fan Pan, Jack Urbanek, Paul Burstein, Alex Fang, Alvin Deng, Amro Abbas, Brett Larsen, Cody Blakeney, Charvi Bannur, Christina Baek, Darren Teh, David Schwab, Haakon Mongstad, Haoli Yin, Josh Wills, Kaleigh Mentzer, Luke Merrick, Ricardo Monti, Rishabh Adiga, Siddharth Joshi, Spandan Das, Zhengping Wang, Bogdan Gaza, Ari Morcos, Matthew Leavitt. "BeyondWeb: Lessons from Scaling Synthetic Data for Trillion-scale Pretraining." (2025) Link

[138]Jinyan Su, Jennifer Healey, Preslav Nakov, Claire Cardie. "Between Underthinking and Overthinking: An Empirical Study of Reasoning Length and correctness in LLMs." (2025) Link

[139]Gu, Yuling, Tafjord, Oyvind, Kuehl, Bailey, Haddad, Dany, Dodge, Jesse, Hajishirzi, Hannaneh. "Olmes: A standard for language model evaluations." *Findings of the Association for Computational Linguistics: NAACL 2025* (2026): 5005–5033

[140]Trillion Labs. "Tri-7B-Base." (2025) Link

[141]Liquid AI. "Introducing LFM2.5: The Next Generation of On-Device AI." (2026) Link

[142]Yue Wang, Qiuzhi Liu, Jiahao Xu, Tian Liang, Xingyu Chen, Zhiwei He, Linfeng Song, Dian Yu, Juntao Li, Zhuosheng Zhang, Rui Wang, Zhaopeng Tu, Haitao Mi, Dong Yu. "Thoughts Are All Over the Place: On the Underthinking of o1-Like LLMs." (2025) Link

[143]Yuyang Wu, Yifei Wang, Ziyu Ye, Tianqi Du, Stefanie Jegelka, Yisen Wang. "When More is Less: Understanding Chain-of-Thought Length in LLMs." (2025) Link

[144]Andreas Hochlehnert, Hardik Bhatnagar, Vishaal Udandarao, Samuel Albanie, Ameya Prabhu, Matthias Bethge. "A Sober Look at Progress in Language Model Reasoning: Pitfalls and Paths to Reproducibility." (2025) Link

[145]Zhixun Chen, Ping Guo, Wenhan Han, Yifan Zhang, Haobin Lin, Fengze Liu, Yan Zhao, Bingni Zhang, Taifeng Wang, Yin Zheng, Trevor Cohn, Meng Fang. "MuRating: A High Quality Data Selecting Approach to Multilingual Large Language Model Pretraining." *The Thirty-ninth Annual Conference on Neural Information Processing Systems* (2026)

[146]Thomas F Burns, Letitia Parcalabescu, Stephan Wäldchen, Michael Barlow, Gregor Ziegltrum, Volker Stampa, Bastian Harren, Björn Deiseroth. "Aleph-Alpha-GermanWeb: Improving German-language LLM pre-training with model-based data curation and synthetic data generation." (2025) Link

[147]Mehdi Ali, Manuel Brack, Max Lübbering, Elias Wendt, Abbas Goher Khan, Richard Rutmann, Alex Jude, Maurice Kraus, Alexander Arno Weber, David Kaczér, Florian Mai, Lucie Flek, Rafet Sifa, Nicolas Flores-Herr, Joachim Köhler, Patrick Schramowski, Michael Fromm, Kristian Kersting. "Judging Quality Across Languages: A Multilingual Approach to Pretraining Data Filtering with Language Models." (2025) Link

[148]Pires, Telmo, Schlinger, Eva, Garrette, Dan. "How multilingual is multilingual BERT?." *arXiv preprint arXiv:1906.01502* (2026)

[149]Choi, Dami, Xin, Derrick, Dadkhahi, Hamid, Gilmer, Justin, Garg, Ankush, Firat, Orhan, Yeh, Chih-Kuan, Dai, Andrew M, Ghorbani, Behrooz. "Order matters in the presence of dataset imbalance for multilingual learning." *Advances in Neural Information Processing Systems* 36 (2026): 66902–66922