Research Updates

06 Jan, 2026

DatBench: Discriminative, Faithful, and Efficient Vision-Language Model Evaluations

Written by

DDatologyAI

Published on06 Jan, 2026

Share on

Table of Contents

Executive Summary

1. Introduction

5. Diagnosing VLM Pathologies with DatBench

6. Conclusion

Get in Touch!

Contributions and Acknowledgements

*NOTE: If you prefer a PDF, the ArXiv version of this post can be found here.

Executive Summary

Empirical evaluation serves as the primary compass guiding research progress in foundation models. Despite a large body of work focused on training frontier vision-language models (VLMs), approaches to their evaluation remain nascent. To guide their maturation, we propose three desiderata that evaluations should satisfy: (1) faithfulness to the modality and application, (2) discriminability between models of varying quality, and (3) efficiency in compute. Through this lens, we identify critical failure modes that violate faithfulness and discriminability, misrepresenting model capabilities: (i) multiple-choice formats reward guessing, do not represent downstream use-cases, and saturate early as models improve; (ii) 'blindly-solvable' questions which can be answered without images, constitute up to 70% of some evaluations; and (iii) mislabeled or ambiguous samples compromise up to 42% of examples in certain datasets. Regarding efficiency, the computational burden of evaluating frontier models has become prohibitive: by some accounts, nearly 20% of development compute is devoted to evaluation alone.

Rather than discarding existing benchmarks, we curate them via transformation and filtering to maximize their fidelity and discriminability. We find that transformations such as converting MCQs to generative tasks reveal sharp capability drops of up to 35%. In addition, filtering blindly-solvable and mislabeled samples enhances the discriminative power of these evaluations, while simultaneously reducing their computational cost. We release DatBench, a cleaned evaluation suite of 33 datasets spanning nine VLM capabilities, and DatBench-Full, a discriminative subset that achieves 13× average speedup (up to 50×) while closely matching the discriminative power of the original datasets. Our work provides a path towards evaluation practices that are both rigorous and sustainable as VLMs continue to scale.

Hugging Face Datasets:

DatBench: https://huggingface.co/datasets/DatologyAI/DatBench
DatBench-Full: https://huggingface.co/datasets/DatologyAI/DatBench-Full

Code: https://github.com/datologyai/DatBench

1. Introduction

Empirical evaluation is the primary mechanism through which progress in foundation models is recognized, compared, and acted upon. As machine learning has shifted from narrow, task-specific systems to general-purpose vision-language models (VLMs) with [20], benchmarks now play an outsized role: they define what counts as progress and directly shape how substantial computational and human resources are allocated. Evaluations are no longer a passive reporting tool but an active driver of research direction.

However, modern evaluation pipelines are increasingly misaligned with the behaviors they aim to measure. As model inputs span multiple modalities and outputs become increasingly generative and stochastic, benchmarks must better disentangle genuine capabilities from superficial heuristics and inherent variance [57]. While the evaluation of language-only models has received sustained methodological attention [54, 55], VLM evaluation remains comparatively under-examined.

Recent evidence suggests that this gap has become a serious liability. Existing VLM benchmarks suffer from pervasive data quality failures, including mislabeled or ambiguous examples, questions solvable without visual input, and heavy reliance on multiple-choice formats that are not representative of downstream use-cases and are vulnerable to spurious correlations [9, 21, 48, 14, 12]. These artifacts inflate reported accuracy, introduce a substantial noise floor, and reduce the signal-to-noise ratio of the evaluations. In such a regime, small improvements, often on the order of a few percent, are more plausibly explained by overfitting to benchmark idiosyncrasies than by real capability gains, rendering the research community vulnerable to hill-climbing on noise [58, 20].

At the same time, evaluation has become a major computational bottleneck. Running comprehensive VLM evaluation suites now consumes a nontrivial fraction of total development compute [60]. For example, during the development of OLMo3, nearly 20% of the total compute budget for the post-training phase was reportedly dedicated to evaluation alone [69]. This burden is amplified for VLMs by the dense visual token sequences required to represent high-resolution images and the extended reasoning traces at inference time, which can collectively exceed tens of thousands of tokens per example [38]. Detailed analyses indicate that much of this cost is spent evaluating samples that are either trivial, noisy, or weakly discriminative [3, 24].

In this work, we argue that the design of effective evaluation should be treated as a data curation problem. Rather than repeatedly constructing new benchmarks from scratch, we propose to systematically transform and filter evaluation data to maximize faithfulness, discriminative power, and efficiency. This perspective mirrors recent successes in training data curation, in which careful data transformation and selection has produced large gains in model quality and compute efficiency [19, 17, 52, 18, 62, 51, 73, 72, 79]. We show that the same principles apply, with similar impact, to evaluation.

Guided by this view, we define three desiderata for modern VLM evaluation datasets, (i) faithfulness: examples should genuinely require visual input and reflect intended downstream use cases; (ii) discriminability: examples should reliably separate stronger models from weaker ones; and (iii) efficiency: evaluation should maximize signal per unit of compute. These criteria expose four systematic failure modes in existing benchmarks and motivate targeted interventions (Section 3).

First, multiple-choice formats are both unfaithful and weakly discriminative in generative settings. Converting MCQs to open-ended generation reveals large hidden capability gaps. On AI2D, for example, average accuracy drops from 77.56% to 40.53%, with the strongest MCQ model losing nearly 35 points. When generative conversion is infeasible, circular evaluation [21] collapses chance baselines and exposes similar inflation effects.

Second, many VLM benchmarks can be solved without vision. By evaluating models with the image removed, we find that over 70% of samples in VQA-v2 can be answered correctly using language priors alone. Such examples fundamentally fail to measure multimodal reasoning.

Third, low-resolution inputs and inaccurate or ambiguous annotations introduce substantial noise. Using a multi-stage filtering pipeline, we discard up to 43.9% in benchmarks such as MME-RealWorld (Autonomous Driving). In these instances, evaluation is confounded by factual labeling errors and indeterminable ground truths---where poor image quality renders the target objects unrecognizable even to a human observer---effectively precluding reliable performance assessment.

Fourth, existing evaluation suites are inefficient. By explicitly selecting items with high discriminative power across a diverse set of 1B–10B scale models, we achieve speedups of up to 50× (13x on average) while closely matching the discriminative power of full benchmarks using a small fraction of the data (Figure 1).

Applying these interventions, we introduce DatBench (Section 4), a curated suite of VLM evaluations designed to be faithful, discriminative, and compute-efficient. To construct it, we partition the large pool of existing datasets into nine fundamental VLM capabilities and release two resulting artifacts:

DatBench, a high-efficiency subset for rapid iteration that provides a 13× speedup on average across all capabilities while increasing signal per sample.
DatBench-Full, the full collection of high-quality samples remaining after excluding blind-solvable or objectively low-quality data.

Beyond efficiency, our work provides empirical insights across 27 state-of-the-art VLMs, revealing structural limitations that are invisible under conventional evaluation (Section 5). We show that inference-time scaling can actively degrade perceptual performance through an overthinking penalty, that current VLMs exhibit a sharp tension between high-level reasoning and low-level perception, and that language priors systematically mask true multimodal capability across popular benchmarks. Together, these resources and findings improve evaluation quality while dramatically reducing its cost, offering a path toward evaluation practices that keep pace with the rapid advancement of vision-language models.

Faithful Evaluation. Recent research has identified significant issues with the validity of VLM benchmarks, prompting various mitigation strategies. To address inflated performance caused by high-risk baselines in multiple-choice evaluations, several studies propose reformulating tasks into generative answer-matching settings [1]or employing circular evaluation techniques [21]. More broadly, prior work shows that ambiguous and hard-to-solve comparative prompts can systematically induce spurious preferences in models, meaning that the evaluation prompts themselves can become a hidden source of bias when they implicitly force a choice without sufficient grounding or context [56]. This further motivates interventions like circular evaluations and other option-robust MCQ protocols. Other efforts focus on statistical refinement of evaluation metrics. For instance, [3] apply Item Response Theory (IRT) motivated weighting to account for item difficulty and discrimination beyond simple average accuracy.

Beyond these issues, multiple-choice formats are also misaligned with real-world VLM usage, where models are typically deployed in open-ended, generative settings rather than selecting from a small, predefined set of options. As a result, strong MCQ performance may overstate practical capability by rewarding option elimination or prompt-specific biases, as MCQ-based evaluations systematically misrepresent model abilities by constraining outputs and failing to probe the generative behaviors that dominate real-world LLM and VLM deployment [6].

Additional analyses suggest that many VLMs can perform well on certain benchmarks without meaningfully leveraging visual input, calling into question whether such evaluations truly measure visual understanding or multimodal reasoning [28, 29, 31, 27, 30]. In contrast to approaches that seek to recover signal through post hoc statistical modeling, our method improves evaluation reliability at the source by enhancing data quality via systematic transformation and filtering of benchmark examples, building on both prior work and newly introduced techniques.

Efficient & Discriminative Evaluation. Efforts to improve the efficiency of model evaluation largely draw from (1) psychometric modeling, and (2) exploiting semantic structure in evaluation data. IRT–based methods [3, 24] model latent capability variables in order to estimate item difficulty and discrimination. In practice, however, these approaches typically require large, dense response matrices (many models evaluated on many items) to fit parameters stably. Without this scale, estimates can become highly sensitive to hyperparameter choices.

An alternative line of work leverages semantic structure. For example, [23] employ embedding-based clustering to select representative subsets, while Scales++ [22] relies on qualitative, rubric-driven segmentation of tasks. These approaches face notable limitations. Clustering outcomes are tightly coupled to the choice of embedding model, a significant concern given the lack of unified multimodal embeddings, while rubric-based methods are inherently labor-intensive and subjective.

More broadly, approaches that optimize solely for preserving model rankings suffer from an inherent limitation. As we show in Section 3.4, rank correlation saturates quickly and can often be achieved even by random subsets whose individual samples do not reliably discriminate between weak and strong models. Consequently, prioritizing rank stability risks overfitting to a fixed set of evaluated models without guaranteeing the quality of the underlying examples. Prior work [71] has also proposed aggregating heterogeneous evaluations via Plackett-Luce models, emphasizing ordinal rankings for their robustness to metric calibration issues. While this addresses the challenge of combining diverse measurements, it operates downstream of data quality, aggregating rankings over noisy or blind-solvable samples still propagates those artifacts into the final ordering.

In contrast to these approaches, we shift the focus from preserving global rankings to the targeted curation of individual samples. First, we systematically transform and filter evaluation data to resolve quality issues such as low resolution and labeling errors. Second, we employ a discriminative subset selection strategy that, unlike rank-preservation methods, identifies high-signal samples without requiring the large-scale model response matrices necessary for stable IRT parameter fitting.

3. The Making of DatBench

3.1. MCQ Evaluations: High Noise, Low Fidelity

In this section, we present the methodology for DatBench, a framework designed to transform noisy, large-scale VLM evaluation suites into high-quality, discriminative benchmarks. Our approach systematically addresses four critical failures in current evaluation regimes: (1) signal dilution in Multiple Choice Questions (MCQs), (2) examples solvable without visual context, (3) incorrect, ambiguous, or low-resolution samples, and (4) prohibitively high computational costs. Collectively, the first three interventions enhance the faithfulness and discrimination of the evaluation data, while the fourth ensures the resulting efficient and discriminative.

Datasets & Capabilities. We define our goal as establishing a faithful, discriminative, and efficient evaluation for nine distinct VLM capabilities (see Fig. 2): (1) Chart Understanding: extracting quantitative data and performing trend analysis on bar charts, pie charts, line graphs, and infographics; (2) Document Understanding: parsing structured layouts and extracting key information from digital or scanned documents, with a focus on text-heavy visual processing; (3) Scene OCR: recognizing and interpreting textual information found in natural environments, such as storefront names, street signs, and product labels; (4) Math & Logic: solving multimodal mathematical problems, including geometry, physics mechanics diagrams, and complex logical puzzles; (5) Spatial Reasoning: assessing the relative positions of objects and demonstrating a directional and physical understanding of 3D space; (6) Grounding: identifying and localizing specific regions or objects referred to in text through bounding boxes or segmentation-style tasks; (7) Counting: accurately enumerating specific objects across varied environments and overlapping visual contexts; (8) Diagrams & Tables: interpreting grade-school diagrams and structured tables to extract data points and infer underlying relationships; and (9) General: performing high-level Visual Question Answering (VQA) based on holistic image descriptions and real-world scene comprehension. To achieve this, we source a diverse pool of evaluation sets for each capability and apply our methodology to address problems (1)–(4), transforming them into refined, high-quality benchmarks. Table 1 details the specific dataset composition and selection rationale used to ensure broad coverage of image distributions across each capability.

Models. We leverage a diverse suite of 27 state-of-the-art models to evaluate and refine our benchmarks. The model families and their corresponding parameter sizes used in this study include: (1) Qwen3-VL (2B, 4B, and 8B Instruct variants, as well as 2B, 4B, and 8B Thinking models); (2) Qwen2.5-VL (3B and 7B Instruct variants); (3) Qwen2.5-Omni (3B and 7B multimodal versions); (4) InternVL3.5 (2B, 4B, and 8B Instruct variants); (5) InternVL3 (2B and 9B Instruct variants); (6) InternVL2.5 (2B, 4B, and 8B variants); (7) InternVL2 (2B, 4B, and 8B variants); and (8) Thinking & Specialist Models, comprising GLM-4.1V-9B (Base and Thinking), R-4B, SmolVLM2-2.2B, Phi-3.5- vision, and Gemma-3-4B-it. Using these models as a broad empirical base allows us to ensure our data-centric improvements generalize beyond any single model family.

For all experiments detailed in this study, model generation was standardized with a maximum output length of 4,096 tokens and suggested sampling configs per the corresponding model card or code repository.

Problem: Chance Baselines and The Evaluation-Deployment Gap. Standard MCQ formats systematically overestimate model capability through two primary mechanisms: random guessing and task misalignment. First, multiple-choice questions introduce a non-trivial chance baseline (1/N for N options), allowing models to achieve inflated scores that add significant noise to performance metrics. This inflation is compounded when evaluating across multiple stochastic samples or models; the probability of an item appearing ”solved” by at least one of M uniform random guesses grows rapidly as 1−(1−1/N)^M. Second, there is a fundamental mismatch between evaluation and deployment: while most VLMs are used in generative contexts, MCQs merely test the ability to pick a candidate from a pre-defined list. This ”closed-set” evaluation fails to capture the generative reasoning required for real-world tasks and allows models to rely on superficial shortcuts or linguistic priors within the options themselves [1]. As shown in Figure 3a, this creates a ”perceived capability” bubble in which models appear proficient in MCQ formats while failing to produce the same answers in a fully generative regime.

Solution: MCQ-to-Generative Transformation and Circular MCQ Evaluation. To bridge this gap, we adopt a two-pronged strategy to ensure measured performance reflects genuine visual reasoning. Wherever viable, we transform MCQs into open-ended generative tasks by removing candidate options and requiring the model to formulate a direct response. To score these free-form outputs without the brittleness of exact-string matching, we employ an LLM-as-judge (specifically Qwen3-30B [63], a cost-effective and capable judge) to perform semantic answer matching as in [1].

We illustrate the impact of this transformation in Figure 3a, which compares standard MCQ accuracy against our generative transformation across 27 models on the AI2D dataset. We observe a distinct non-linear relationship: while high-performing models (80%+ MCQ accuracy) show tighter convergence between generative and discriminative performance, lower-tier models exhibit a sharp drop-off in the generative setting. This confirms that for weaker models, traditional MCQ benchmarks often mask a fundamental lack of generative skill through random guessing and closed-set shortcuts.

For tasks where options are structurally necessary, specifically inherently discriminative questions like "Which of the following..." where generative conversion would alter the question's core intent, we implement Circular Evaluation [21]. By rotating option permutations across N passes and crediting a point only if the model identifies the correct answer across all rotations, we effectively collapse the chance baseline. As shown in Figure 3b across 27 models, circular evaluation yields a steeper-than-unity slope relative to vanilla MCQ. This slope captures the persistence of the ”false floor” inherent in standard formats: while vanilla MCQs grant models a significant head start (often 20-30% accuracy) through random guessing and position bias, circular evaluation reveals that genuine reasoning capability remains near zero for these same models. The steepness of the curve illustrates that vanilla MCQ continues to significantly inflate perceived performance while true accuracy remains low (< 50%); it is only as models achieve high-level robustness that the two metrics begin to align. By stripping away this artificial inflation, we ensure the benchmark can accurately signal the transition from zero to genuine competence, a critical signal that is otherwise obscured by the noisy MCQ baseline.

Correcting this inflation is crucial: a benchmark is most valuable when it can accurately track the emergence of a new capability. By allowing MCQ formats to provide a ”false floor” of performance, we lose the ability to signal when a model truly transitions from zero to non-zero capability. Ultimately, these stricter criteria ensure that DatBench provides a more faithful representation of genuine model competence by stripping away the artificial inflation inherent in traditional formats.

3.2. The Mirage of Visual Understanding

Problem: Language Priors are often all you need. A significant challenge in VLM evaluation is “blind solvability”, a phenomenon in which models correctly answer questions without visual input by exploiting the language priors encoded in their LM backbones. This phenomenon fundamentally decouples benchmark performance from actual multimodal reasoning, inadvertently rewarding models with stronger language priors rather than superior visual understanding. This creates a “mirage” of progress, due to which improvements in the vision encoder or cross-modal connector are masked by the overwhelming influence of the text-based backbone. Consequently, models with more capable LMs are often deemed to be stronger VLMs simply because they are better at guessing answers from context.

Solution: Filtering Blind-Solvable Questions. To ensure DatBench measures genuine vision-language integration, we systematically identify and remove samples that models can solve “blind.” We conduct a comprehensive evaluation where all 27 models in our suite are queried using only the text prompts from each dataset, without the corresponding image inputs. For each dataset, we visualize this in a histogram (Figure 4) where the x-axis represents the number of models answering correctly and the y-axis represents the fraction of the dataset solved at that model-frequency.

As shown in Table 2, blind-solvable questions typically fall into three categories: (1) World Knowledge, where the answer is physically or culturally standard (e.g., a mosquito’s four-stage life cycle); (2) Visual Stereo-typicality, where models exploit the skewed distribution of attributes in natural images to predict answers without visual confirmation (e.g., toilets usually being white); and (3) Purely Symbolic Reasoning, where the question contains all necessary information for a LLM to solve via logic or arithmetic (e.g., counting digits in a range).

We employ a systematic thresholding strategy (τ) to define rejection criteria based on task format. For datasets with open-ended, generative answers where the probability of a model guessing the exact string is negligible, we set a strict threshold (τ = 1); any sample solved by even a single model without visual input is discarded (e.g., CharXiv-Descriptive). Conversely, for tasks with a constrained solution space—such as Multiple Choice Questions (MCQ) or specialized counting tasks—we set higher thresholds to account for the increased baseline of random guessing and language priors. This includes datasets like CountBench, where answers are concentrated at low integers, or specific questions in AI2D that feature a limited set of candidate solutions evident from the prompt (see Row 1 of Table 2).

As illustrated in Figure 4a for AI2D, the distribution shows a significant “tail” of questions solvable by nearly all models without an image. Even in more recent evaluations like CharXiv Descriptive (Figure 4b), a large fraction of samples are solvable through language priors alone despite the descriptive nature of the task. In the General capability, this issue is most acute: over 70% of examples can be answered without the image. By removing these samples, DatBench ensures the evaluation focuses on high-quality data where visual reasoning is mandatory for success.

3.3. Incorrect Ground Truth and Ambiguity

Problem: The Cost of Evaluative Noise. Despite significant curation efforts, many existing VLM benchmarks contain non-trivial proportions of examples with incorrect ground-truth labels, ambiguous questions, or insufficient image resolution to support the required reasoning. Such noise fundamentally compromises benchmark validity; when a dataset punishes a model for providing a correct answer that contradicts a flawed label, it obscures genuine capability gains and encourages “hill-climbing on noise”. Since we source DatBench from a massive aggregate pool of candidate datasets, we have a surplus of examples that allows us to prioritize rigorous data quality over raw quantity.

Solution: Two-Stage Quality Filtering with VLM-as-Judge. To identify and purge these artifacts, we employ a two-stage filtering pipeline. In the first stage, we flag examples that all evaluated models (1–10B parameters) answer incorrectly. Unanimous failure across a diverse suite of state-of-the-art models typically indicates either a data quality issue or a genuinely difficult frontier case, both of which warrant closer inspection. In the second stage, a strong VLM judge (GPT-5.2) verifies each flagged sample with access to the ground-truth answer as privileged information.

Our choice of a frontier model as a judge is motivated by prior work suggesting that models are significantly stronger verifiers than generators [33, 34, 35, 32]; we therefore expect the judge to accurately identify errors even in cases that are too challenging for contemporary models to solve autonomously. Given that we operate in a regime of abundant evaluation data across our 9 capabilities, we intentionally err on the side of caution. We adopt a stringent filtering policy, discarding any item flagged as (1) ambiguous, (2) incorrectly labeled, or (3) unsolvable due to insufficient resolution, ensuring that the resulting DatBench subset represents only the highest quality of evaluation data. The impact of this pipeline is most visible in the Spatial capability, which exhibits a 42.07% discard rate, primarily due to insufficient resolution in “in-the-wild” images. Similarly, complex expert-authored sets like ChartQA Pro (17.2% removed) and MMMU-Pro (24.3% removed) show significantly higher noise rates than standard benchmarks (c.f. Appendix D in the paper for per dataset / capability counts of filtered examples). While these high attrition rates reflect significant noise in frontier evaluations, we recognize that a judge might occasionally misinterpret specialized, valid reasoning as a data defect. To maintain evaluative headroom, we retain only the subset of these examples that the judge explicitly validates as correct and unambiguous. Our aggregate data surplus allows us to prioritize this high-fidelity subset, accepting the risk that a conservative filtering policy may sacrifice some valid frontier samples to ensure the remaining benchmark remains strictly noise-free.

3.4. High Discrimination with Limited Compute

Problem: The Computational Burden of Comprehensive Evaluation. As VLMs grow in sophistication and expand their set of capabilities, comprehensive evaluation imposes a prohibitive computational burden. This is exacerbated by the emergence of “thinking” models; for instance, Bai et al. [38] utilize inference-time compute scaling, often generating chains-of-thought exceeding 32K tokens. Consequently, evaluating a single capability like OCR (often containing > 100K examples) can require generating over 3 billion tokens, an untenable cost for iterative research.

Selecting a representative subset of examples is a natural approach to reducing evaluation costs. The intuitive heuristic for such a selection is to preserve the model ranking induced by the full dataset, typically quantified using rank correlation measures such as Spearman’s ρ or Kendall’s τ [65, 66, 67]. While rank preservation is a necessary condition for a representative subset, it is theoretically insufficient: rank correlation is agnostic to which specific samples are retained. In practice, even random subsets can preserve global model rankings by retaining items that separate coarse capability tiers (e.g., small versus large models), while failing to retain the high-discrimination examples needed to distinguish models along the Pareto frontier. More broadly, methods that optimize solely for rank preservation face a fundamental limitation, rank correlation saturates rapidly and is often achieved by subsets whose individual samples are weakly or inconsistently informative about underlying capabilities [68, 66]. In such regimes, apparent ranking stability may be driven by spurious correlations or superficial artifacts rather than genuine reasoning ability.

Instead, we turn to Item Response Theory (IRT) for inspiration, originally formalized [36]. IRT posits that items differ not just in difficulty, but in item discrimination, a parameter that determines how sharply an item distinguishes between subjects of varying ability levels [37]. However, directly applying standard IRT methodologies [3] to VLM evaluation is often infeasible due to the limited number of diverse observations available per sample in the current research landscape [22, 64]. Effectively fitting IRT models typically requires stable evaluations from hundreds of diverse state-of-the-art models; without this scale, IRT models become highly sensitive to hyperparameters and are notoriously difficult to fit stably.

Consequently, simply prioritizing rank stability risks overfitting to the evaluated model suite, without guaranteeing the quality or generalizability of the underlying examples. In effect, this produces a “coarse” measuring stick: it yields a subset that is discriminative enough to recover a specific ranking but lacks the resolution to generalize to unseen models or distinguish those with similar capabilities. Therefore, the core optimization problem is not merely to maintain ranking stability, but to maximize total discrimination. By ensuring every sampled example possesses high discriminative power, we can implicitly guarantee robust ranking while maximizing the information content per inference token.

Solution: Item-Discrimination Based Subset Selection. To avoid the instability of IRT models that are sensitive to hyperparameters and sample size, we operationalize item-discrimination using the point-biserial correlation (r_pb): a robust, hyperparameter-free measure of the association between a binary item response and continuous model capability. Intuitively, r_pb measures the extent to which success on a specific question acts as a proxy for global performance. An item with high r_pb is one that strong models consistently answer correctly and weak models consistently miss; conversely, a low or negative r_pb indicates a noisy item that fails to track with underlying capability. We define total discriminative power as the sum of discrimination of each example (item).

We select subsets by prioritizing examples with the highest r_pb to maximize information density (c.f. Appendix E in the paper). As demonstrated in Figure 7a, DatBench achieves approximately 90% of the total discriminative power using only 40% of the full dataset, whereas random sampling scales linearly and provides less than half that signal at the same budget. Notably, our selection curve peaks above 1.0 before the full dataset is included; this occurs because we intentionally deprioritize ”anomalous items” at the end of our selection process. These are questions with negative r_pb where weaker models outperform stronger ones—likely due to spurious text-based correlations, prompt sensitivity, or test-set leakage—which effectively introduce noise into the evaluation.

While Figure 7b shows that both random and discriminative subsets rapidly achieve high rank correlation, this similarity is deceptive. Because our model suite contains distinct performance tiers (e.g., 1B vs. 8B), the global ranking is easily recovered even by uninformative samples. Rank correlation is thus a ”low-bar” metric that saturates too quickly to reflect subset quality. By maximizing discrimination, DatBench provides a higher-fidelity instrument that remains sensitive to marginal capability gains and ensures that evaluation remains stable across unseen model architectures.

4. Introducing DatBench and DatBench-Full

By applying our four-stage pipeline: MCQ transformation (Section 3.1), blind-solvability filtering (Section 3.2), quality filtering (Section 3.3), and discriminative selection (Section 3.4), we transform noisy, redundant dataset aggregations into precise evaluation artifacts. These artifacts cover nine distinct capabilities: Chart Understanding, Document Understanding, Scene OCR, Grounding, Counting, Spatial Reasoning, Math & Logic, Diagrams & Tables, and General VQA. We release two versions of the benchmark to cater to varying computational budgets.

For the final DatBench subset, we execute steps 1 through 4. However, the discrimination-based selection in Step 4 naturally discards ”frontier” examples—items that all evaluated models fail—as they offer near-zero discrimination by construction. To prevent benchmark saturation and ensure evaluative headroom for future models, we manually allocate up to 20% of the DatBench subsets for these valid frontier cases, specifically those verified by our VLM-as-judge as correct and unambiguous. This strategic inclusion ensures that DatBench maintains a high difficulty ceiling while remaining a robust instrument for measuring progress at the frontier of vision-language modeling.

DatBench: Our primary, high-efficiency subset tailored for rapid iterative development. Constructed via item-wise point-biserial correlation (r_pb), this set maintains high ranking fidelity while minimizing inference costs. We explicitly retain a partition of verified, high-quality “frontier” examples—currently unsolvable by 1B–10B models—to ensure the benchmark remains an effective measuring stick as model capabilities scale.
DatBench-Full: The complete aggregation of all high-quality samples remaining after our systematic filtering pipeline (Steps 1–3). While these sets include all examples validated as objectively high-quality, their scale varies significantly across capabilities based on the severity of the filtering required. For capabilities such as Counting and Spatial Reasoning, where high noise and blind-solvability rates resulted in massive attrition, DatBench-Full is comparable in size to the DatBench subset. However, for most capabilities, DatBench-Full evaluation sets are an order of magnitude larger, reaching up to 50×the size of their efficient counterparts. These are intended for extensive, fine-grained error analysis and established as a comprehensive resource for deep-dive capability assessment.

Usage Guide. We recommend DatBench in high-iteration contexts such as training loops and ablation studies, in which compute costs for evaluation can rapidly balloon but discriminative signal should be maximized. DatBench-Full should be reserved for final model reporting where computational constraints are relaxed and maximum coverage is desired. Collectively, these artifacts transition multimodal evaluation from a regime of noisy data to one of precise measurement.

Having established these artifacts, we provide a comprehensive statistical analysis of how DatBench transforms raw benchmark data into faithful and discriminative instruments for the efficient estimation of VLM capabilities

DatBench discards samples that are too easy / too hard. The most immediate impact of our filtering is the removal of samples that act as statistical noise. In the General capability, model performance is significantly shifted downward from the y= x diagonal (Figure 8), a direct result of Stage 2 filtering (c.f. Appendix F in the paper) which discarded 72.07% of samples solvable via language priors alone. Conversely, the Spatial Reasoning capability underwent rigorous quality filtering in Stage 3 (c.f. Appendix D in the paper), with 42.07% of samples removed due to ambiguity or insufficient resolution. This systematic removal of evaluative noise shifts model assessments to a more faithful performance tier, ensuring that benchmark outcomes accurately reflect genuine multimodal reasoning.

DatBench is more discriminative. Our item-selection methodology amplifies performance differences between models, increasing measurement resolution. On the original General benchmarks, models compress into a narrow 65–80% accuracy band; on DatBench, they spread across 10–65%, a nearly 4× expansion in effective score range. This “stretching” reflects our point-biserial selection criterion (Section 3.4): by retaining only items where strong models reliably succeed and weak models reliably fail, small capability differences that were previously masked now manifest as measurable gaps. The steep slopes observed in General and Document Understanding (Figure 8) confirm this effect; equivalent spacing on the original benchmarks translates to greater separation on DatBench.

DatBench preserves discrimination power with far fewer samples. Despite aggressive filtering, ranking stability is maintained. For capabilities such as Chart Understanding and Grounding, DatBench points fall almost perfectly on the y= x line (Figure 8), confirming that the subset preserves the discriminability and model rankings from the full dataset. As shown in our Stage 4 efficiency analysis (c.f. Appendix E in the paper), DatBench maintains high total discriminative power even at severely restricted budgets, whereas random sampling suffers from linear signal degradation.

Limitations and Future Directions. While our methodology offers a substantial leap forward, several avenues remain for future exploration:

Scaling to Larger Regimes: Our current analysis focuses on models in the 1B–10B parameter range and inference traces within standard context windows. While the methodology is scale-invariant, the specific subsets of highly discriminative questions will likely shift for larger models and extended inference budgets (e.g., exceeding the 4096 tokens used in our work). Future work can apply this pipeline to larger model families and longer reasoning traces to identify the discriminative frontier for state-of-the-art systems.
Diversity Guarantees: Our current subset selection prioritizes the highest individual discrimination scores, which implicitly relies on the inherent variety of the source data rather than an explicit diversity constraint. Consequently, this objective does not formally account for redundancy between samples; in pathological cases (e.g., duplicate but highly discriminative examples), the selection could theoretically yield a degenerate or repetitive subset. While we mitigate this through rigorous initial curation, future iterations could incorporate explicit diversity-aware objectives to ensure broader coverage of the capability space.
Expanding Capabilities: We aim to extend our capability map beyond static images to include long-form video understanding, UI/GUI grounding, and robotics perception.
DatBench-Live: Finally, discrimination is a moving target; questions that distinguish today’s models will eventually become trivial. We envision a dynamic, “living” benchmark where subsets are recomputed periodically as new models shift the capability distribution and new datasets emerge.

5. Diagnosing VLM Pathologies with DatBench

In this section, we leverage the high-signal artifacts produced by the DatBench pipeline to diagnose the behavioral pathologies of modern VLMs. By analyzing performance across 27 state-of- the-art models spanning the 1B–10B parameter range, we uncover fundamental trade-offs between semantic reasoning and perceptual grounding, risks and rewards of inference-time scaling, and the impact of language priors on evaluation metrics.

Takeaway 1: Capability Correlations Reveal a ”Reasoning vs. Perception” Trade-off. To identify hidden relationships between tasks, we calculated Pearson correlations (r) between all capability scores across our model suite (Figure 9a). We identify a tight Reasoning Cluster in which Chart Understanding, Math, and General VQA exhibit exceptionally high pairwise correlations, such as r = 0.90 between Chart and General tasks. This analysis confirms that General VQA benchmarks, such as MMBench and MMMU-Pro, primarily test abstract reasoning capabilities that are also fundamental to Math, evidenced by a strong correlation of r = 0.76 between these two domains. Furthermore, a distinct Spatial-Semantic Trade-off exists: Grounding correlates negatively with text-heavy tasks like Document Understanding (r =−0.29) and OCR (r =−0.19). These negative relationships, alongside the inverse correlation between Math and Spatial Reasoning (r =−0.19), suggest a latent conflict in current training paradigms between high-level semantic processing and low-level perceptual fidelity. Hierarchical clustering (Figure 9b) corroborates this dichotomy, revealing two distinct clusters: reasoning (Chart, Math, General) and perception (OCR, Spatial, Diagram). Prior work has observed a similar overthinking penalty for language models [80, 81, 82, 83].

Takeaway 2: Capability Profiles Reveal Specialist-Generalist Trade-offs. This trade-off manifests in distinct model archetypes (Figure 10). GLM-4.1V-9B acts as a perception specialist, leading in diagram understanding (66.4%) and spatial reasoning (36.8%) but struggling with math (17.4%). Balanced generalists are rare: Qwen3-VL-4B is a notable exception, maintaining strong document understanding (71.0%), OCR (77.9%), and reasoning (59.9%). Most tellingly, R-4B reaches the highest math score (43.4%) at the cost of the lowest spatial performance (11.4%), suggesting that (current) reasoning-focused training can degrade visual grounding.

Takeaway 3: The ”Overthinking” Penalty: Inference-Time Scaling Degrades Perception at High Cost. Comparing Thinking models to standard counterparts reveals that extra test-time compute is a double-edged sword (Figure 11a). To quantify this, we define the Thinking relative advantage as the percentage gain in accuracy of the thinking model over its instruct counterpart, normalized by the instruct baseline: (Acc_thinking−Acc_instruct)/Acc_instruct×100. Scaling helps Math (≈ +36.8%) and Charts (≈ +10.8%) but causes massive regressions in OCR (≈ −53.5%) and Document Understanding (≈−47.8%). This regression is also extremely computationally wasteful: while correct thinking answers use ≈425 tokens, incorrect attempts balloon to ≈1196.9 tokens, a ≈14× increase over non-thinking models (Figure 11b). We observed that this is due to models entering unproductive thinking loops on perceptual tasks they cannot solve.

Takeaway 4: Language Priors Mask True Multimodal Performance across Capabilities. To isolate the actual visual requirement of each task, we analyze the vision delta (V_∆), defined as the performance gap between standard multimodal evaluation and a blind text-only baseline (Figure 12). Our results show that reliance on language priors varies drastically by capability, often distorting perceived progress in multimodal reasoning. Capabilities such as Counting (V_∆ = 60.2%) and Grounding (V_∆ = 42.3%) exhibit high vision dependency, making them the most faithful indicators of true perceptual accuracy. Conversely, Math (V_∆ = 13.0%) and Spatial Reasoning (V_∆ = 14.9%) show significant language prior distortion, relying heavily on textual patterns that allow models to guess correctly without the image. These findings confirm that without the rigorous filtering introduced in DatBench, i.e. discarding samples that can be solved with the language prior alone, high scores in capabilities like Math may inadvertently reward stronger language models rather than superior vision-language integration.

6. Conclusion

In this work, we addressed the dual challenges of data quality and computational cost in the evaluation of Vision-Language Models (VLMs). We introduced a framework of three desiderata that evaluations should satisfy: (1) faithfulness to the modality and application, (2) discriminability between models of varying quality, and (3) efficiency in compute. We then applied this lens to expose four critical pathologies in existing benchmarks: multiple-choice formats are both unfaithful and weakly discriminative; many VLM benchmarks can be solved without vision; incorrect and ambiguous ground truth introduces substantial noise; and existing evaluation suites are inefficient. We used these insights to distill these benchmarks into high-signal evaluation suites.

Our primary contribution, DatBench, serves as a precise, psychometrically grounded instrument for measuring multimodal capability. Motivated by Item Response Theory (IRT) and operationaliz- ing discrimination via point-biserial correlation (rpb), we demonstrated that maximizing total test discrimination yields subsets that are not only computationally lightweight but also significantly more robust and generalizable than those derived via random sampling or simple rank correlation. Our accompanying analysis of “thinking” models and language priors further validates that DatBench is capable of surfacing nuanced behavioral insights that are often obscured in aggregate metrics. We release two versions of the benchmark, the efficiency-focused DatBench for rapid iterative development (yielding 13×average speedup), and the comprehensive DatBench-Full for final reporting, to standardize comparison and accelerate progress at the pareto frontier. Our work provides a path towards evaluation practices that are both rigorous and sustainable as VLMs continue to scale.

Get in Touch!

If you're interested in pushing the bounds of what's possible with data curation, we're looking for talented Members of Technical Staff who have experience doing data research, building research tooling, translating science into products, and building scalable data products.

If you're interested in training multimodal and/or text models faster, better, or smaller, Get in touch!

Contributions and Acknowledgements

Core Contributors

Siddharth Joshi, Haoli Yin, Rishabh Adiga, and Ricardo Monti.

for fusing modalities, wrangling the datasets, and ensuring the evaluation pipeline didn’t hallucinate.

Technical Contributors

Aldo Carranza, Alex Fang, Alvin Deng, Amro Abbas, Brett Larsen, Cody Blakeney, Darren Teh, David Schwab, Fan Pan, Haakon Mongstad, Jack Urbanek, Jason Lee, Jason Telanoff, Josh Wills, Kaleigh Mentzer, Luke Merrick, Parth Doshi, Paul Burstein, Pratyush Maini, Scott Loftin, Spandan Das, Tony Jiang, Vineeth Dorna, and Zhengping Wang.

the ensemble of experts who maximized our few-shot performance and ablated every hyperparameter.

Leadership

Bogdan Gaza, Ari Morcos, and Matthew Leavitt.

the ground truth oracles who steered the project and prevented collective mode collapse.

Acknowledgements

Liz Gatapia (for incredible logo design), Jacqueline Liu, Tiffanie Pham, Sylvia Hoang, Jayla Lindsey, Kylie Clement, Elise Clark

the human-in-the-loop feedback that provided essential regularization and support.

References

[1]Chandak, Nikhil, Goel, Shashwat, Prabhu, Ameya, Hardt, Moritz, Geiping, Jonas. "Answer Matching Outperforms Multiple Choice for Language Model Evaluation." *arXiv preprint arXiv:2507.02856* (2025)

[2]Rahmanzadehgervi, Pooyan, Bolton, Logan, Taesiri, Mohammad Reza, Nguyen, Anh. "Vision language models are blind." *arXiv preprint arXiv:2407.06581* (2025)

[3]Schick, Timo, others. "Fluid Language Model Benchmarking." *arXiv preprint arXiv:2509.11106* (2025)

[4]Unknown Author Team. "Pretraining on the Test Set Is No Longer All You Need: A Debate-Driven Approach to QA Benchmarks." *arXiv preprint arXiv:2507.17747* (2025)

[5]Lord, Frederic M. "Applications of item response theory to practical testing problems." (1980)

[6]Li, Xian, Li, Yu, Zhang, Rui, Zhou, Jie, Sun, Maosong. "Can Multiple-Choice Questions Really Be Useful in Detecting the Abilities of LLMs?." *Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING)* (2024) Link

[7]Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty, Enamul Hoque. "ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning." (2022) Link

[8]Zirui Wang, Mengzhou Xia, Luxi He, Howard Chen, Yitao Liu, Richard Zhu, Kaiqu Liang, Xindi Wu, Haotian Liu, Sadhika Malladi, Alexis Chevalier, Sanjeev Arora, Danqi Chen. "CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs." (2024) Link

[9]Ahmed Masry, Mohammed Saidul Islam, Mahir Ahmed, Aayush Bajaj, Firoz Kabir, Aaryaman Kartha, Md Tahmid Rahman Laskar, Mizanur Rahman, Shadikur Rahman, Mehrad Shahmohammadi, Megh Thakkar, Md Rizwan Parvez, Enamul Hoque, Shafiq Joty. "ChartQAPro: A More Diverse and Challenging Benchmark for Chart Question Answering." (2025) Link

[10]Minesh Mathew, Viraj Bagal, Rubèn Pérez Tito, Dimosthenis Karatzas, Ernest Valveny, C. V Jawahar. "InfographicVQA." (2021) Link

[11]Anand Mishra, Shashank Shekhar, Ajeet Kumar Singh, Anirban Chakraborty. "OCR-VQA: Visual Question Answering by Reading Text in Images." *ICDAR* (2019)

[12]Zhibo Yang, Jun Tang, Zhaohai Li, Pengfei Wang, Jianqiang Wan, Humen Zhong, Xuejing Liu, Mingkun Yang, Peng Wang, Shuai Bai, LianWen Jin, Junyang Lin. "CC-OCR: A Comprehensive and Challenging OCR Benchmark for Evaluating Large Multimodal Models in Literacy." (2024) Link

[13]Minesh Mathew, Dimosthenis Karatzas, C. V. Jawahar. "DocVQA: A Dataset for VQA on Document Images." (2021) Link

[14]Zhang, Yi-Fan, Zhang, Huanyu, Tian, Haochen, Fu, Chaoyou, Zhang, Shuangqing, Wu, Junfei, Li, Feng, Wang, Kun, Wen, Qingsong, Zhang, Zhang, others. "MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?." *arXiv preprint arXiv:2408.13257* (2025)

[15]Singh, Amanpreet, Natarjan, Vivek, Shah, Meet, Jiang, Yu, Chen, Xinlei, Parikh, Devi, Rohrbach, Marcus. "Towards VQA Models That Can Read." *Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition* (2025): 8317-8326

[16]Guilherme Penedo, Hynek Kydlíček, Loubna Ben allal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro Von Werra, Thomas Wolf. "The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale." (2024) Link

[17]Joshi, Siddharth, Mirzasoleiman, Baharan. "Data-Efficient Contrastive Self-supervised Learning: Most Beneficial Examples for Supervised Learning Contribute the Least." *Proceedings of the 40th International Conference on Machine Learning* (2023): 15356–15370 Link

[18]Joshi, Siddharth, Jain, Arnav, Payani, Ali, Mirzasoleiman, Baharan. "Data-Efficient Contrastive Language-Image Pretraining: Prioritizing Data Quality over Quantity." *Proceedings of The 27th International Conference on Artificial Intelligence and Statistics* (2024): 1000–1008 Link

[19]Alex Fang, Albin Madappally Jose, Amit Jain, Ludwig Schmidt, Alexander Toshev, Vaishaal Shankar. "Data Filtering Networks." (2023) Link

[20]Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, Ed H. Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, William Fedus. "Emergent Abilities of Large Language Models." (2022) Link

[21]Liu, Yuan, Duan, Haodong, Zhang, Yuanhan, Li, Bo, Zhang, Songyang, Zhao, Wangbo, Yuan, Yike, Wang, Jiaqi, He, Conghui, Liu, Ziwei, others. "Mmbench: Is your multi-modal model an all-around player?." *European conference on computer vision* (2024): 216–233

[22]Bean, Andrew M, Seedat, Nabeel, Chen, Shengzhuang, Schwarz, Jonathan Richard. "Scales++: Compute Efficient Evaluation Subset Selection with Cognitive Scales Embeddings." *arXiv preprint arXiv:2510.26384* (2025)

[23]Vivek, Rajan, Ethayarajh, Kawin, Yang, Diyi, Kiela, Douwe. "Anchor points: Benchmarking models with much fewer examples." *Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)* (2025): 1576–1601

[24]Polo, Felipe Maia, Weber, Lucas, Choshen, Leshem, Sun, Yuekai, Xu, Gongjun, Yurochkin, Mikhail. "tinyBenchmarks: evaluating LLMs with fewer examples." *arXiv preprint arXiv:2402.14992* (2025)

[25]Kipnis, Alex, Voudouris, Konstantinos, Buschoff, Luca M Schulze, Schulz, Eric. "metabench--A Sparse Benchmark of Reasoning and Knowledge in Large Language Models." *arXiv preprint arXiv:2407.12844* (2025)

[26]Tate, Robert F. "Correlation between a discrete and a continuous variable. Point-biserial correlation." *The Annals of mathematical statistics* 25, no. 3 (1954): 603–607

[27]Wang, Jiayu, Ming, Yifei, Shi, Zhenmei, Vineet, Vibhav, Wang, Xin, Li, Sharon, Joshi, Neel. "Is a picture worth a thousand words? delving into spatial reasoning for vision language models." *Advances in Neural Information Processing Systems* 37 (2025): 75392–75421

[28]Lee, Kang-il, Kim, Minbeom, Yoon, Seunghyun, Kim, Minsung, Lee, Dongryeol, Koh, Hyukhun, Jung, Kyomin. "VLind-Bench: Measuring Language Priors in Large Vision-Language Models." *Findings of the Association for Computational Linguistics: NAACL 2025* (2025): 4129–4144 Link

[29]Jian Li, Weiheng Lu, Hao Fei, Meng Luo, Ming Dai, Min Xia, Yizhang Jin, Zhenye Gan, Ding Qi, Chaoyou Fu, Ying Tai, Wankou Yang, Yabiao Wang, Chengjie Wang. "A Survey on Benchmarks of Multimodal Large Language Models." (2024) Link

[30]YiFan Zhang, Yang Shi, Weichen Yu, Qingsong Wen, Xue Wang, Wenjing Yang, Zhang Zhang, Liang Wang, Rong Jin. "Debiasing Multimodal Large Language Models via Penalization of Language Priors." (2025) Link

[31]Zhiqiu Lin, Xinyue Chen, Deepak Pathak, Pengchuan Zhang, Deva Ramanan. "Revisiting the Role of Language Priors in Vision-Language Models." (2024) Link

[32]Guan, Jian, Dodge, Jesse, Wadden, David, Huang, Minlie, Peng, Hao. "Language Models Hallucinate, but May Excel at Fact Verification." *Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)* (2024) Link

[33]Liao, Yuan-Hong, Mahmood, Rafid, Fidler, Sanja, Acuna, David. "Can Large Vision-Language Models Correct Semantic Grounding Errors By Themselves?." *Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)* (2025): {14667-14678

[34]Saad-Falcon, Jon, Buchanan, E Kelly, Chen, Mayee F, Huang, Tzu-Heng, McLaughlin, Brendan, Bhathal, Tanvir, Zhu, Shang, Athiwaratkun, Ben, Sala, Frederic, Linderman, Scott, others. "Shrinking the Generation-Verification Gap with Weak Verifiers." *arXiv preprint arXiv:2506.18203* (2025)

[35]V Venktesh, Mandeep Rathee, Avishek Anand. "Trust but Verify! A Survey on Verification Design for Test-time Scaling." (2025) Link

[36]Lord, Frederic M. "A theory of test scores." *Psychometrika measures* 7, no. 1 (2025)

[37]Baker, Frank B. "The basics of item response theory." (2001)

[38]Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan Liu, Dunjie Lu, Ruilin Luo, Chenxu Lv, Rui Men, Lingchen Meng, Xuancheng Ren, Xingzhang Ren, Sibo Song, Yuchong Sun, Jun Tang, Jianhong Tu, Jianqiang Wan, Peng Wang, Pengfei Wang, Qiuyue Wang, Yuxuan Wang, Tianbao Xie, Yiheng Xu, Haiyang Xu, Jin Xu, Zhibo Yang, Mingkun Yang, Jianxin Yang, An Yang, Bowen Yu, Fei Zhang, Hang Zhang, Xi Zhang, Bo Zheng, Humen Zhong, Jingren Zhou, Fan Zhou, Jing Zhou, Yuanzhi Zhu, Ke Zhu. "Qwen3-VL Technical Report." (2025) Link

[39]Lu, Pan, Bansal, Hritik, Xia, Tony, Liu, Jiacheng, Li, Chunyuan, Hajishirzi, Hannaneh, Cheng, Hao, Chang, Kai-Wei, Galley, Michel, Gao, Jianfeng. "MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts." *International Conference on Learning Representations (ICLR)* (2025)

[40]Yijia Xiao, Edward Sun, Tianyu Liu, Wei Wang. "LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts." (2024) Link

[41]Kazemzadeh, Sahar, Ordonez, Vicente, Matten, Mark, Berg, Tamara. "ReferItGame: Referring to Objects in Photographs of Natural Scenes." *Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing ({EMNLP})* (2014) Link

[42]Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tripathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, Jiasen Lu, Taira Anderson, Erin Bransom, Kiana Ehsani, Huong Ngo, YenSung Chen, Ajay Patel, Mark Yatskar, Chris Callison-Burch, Andrew Head, Rose Hendrix, Favyen Bastani, Eli VanderBilt, Nathan Lambert, Yvonne Chou, Arnavi Chheda, Jenna Sparks, Sam Skjonsberg, Michael Schmitz, Aaron Sarnat, Byron Bischoff, Pete Walsh, Chris Newell, Piper Wolters, Tanmay Gupta, Kuo-Hao Zeng, Jon Borchardt, Dirk Groeneveld, Crystal Nam, Sophie Lebrecht, Caitlin Wittlif, Carissa Schoenick, Oscar Michel, Ranjay Krishna, Luca Weihs, Noah A. Smith, Hannaneh Hajishirzi, Ross Girshick, Ali Farhadi, Aniruddha Kembhavi. "Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models." (2024) Link

[43]Roni Paiss, Ariel Ephrat, Omer Tov, Shiran Zada, Inbar Mosseri, Michal Irani, Tali Dekel. "Teaching CLIP to Count to Ten." (2023) Link

[44]Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, Ali Farhadi. "A Diagram Is Worth A Dozen Images." (2016) Link

[45]Xiang Yue, Tianyu Zheng, Yuansheng Ni, Yubo Wang, Kai Zhang, Shengbang Tong, Yuxuan Sun, Botao Yu, Ge Zhang, Huan Sun, Yu Su, Wenhu Chen, Graham Neubig. "MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark." (2025) Link

[46]Yash Goyal, Tejas Khot, Douglas Summers{-}Stay, Dhruv Batra, Devi Parikh. "Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering." *Conference on Computer Vision and Pattern Recognition (CVPR)* (2017)

[47]Mao, Junhua, Huang, Jonathan, Toshev, Alexander, Camburu, Oana, Yuille, Alan L., Murphy, Kevin. "Generation and Comprehension of Unambiguous Object Descriptions." *Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)* (2025)

[48]xAI. "RealWorldQA: A Benchmark for Real-World Visual Understanding." (2024)

[49]Siddharth Joshi, Yu Yang, Yihao Xue, Wenhan Yang, Baharan Mirzasoleiman. "Challenges and Opportunities in Improving Worst-Group Generalization in Presence of Spurious Features." (2025) Link

[50]Varma, Maya, Delbrouck, Jean-Benoit, Chen, Zhihong, Chaudhari, Akshay, Langlotz, Curtis. "RaVL: Discovering and Mitigating Spurious Correlations in Fine-Tuned Vision-Language Models." *Advances in Neural Information Processing Systems* (2025): 82235–82264 Link

[51]Siddharth Joshi, Besmira Nushi, Vidhisha Balachandran, Varun Chandrasekaran, Vibhav Vineet, Neel Joshi, Baharan Mirzasoleiman. "MM-GEN: Enhancing Task Performance Through Targeted Multimodal Data Curation." (2025) Link

[52]Amro Abbas, Kushal Tirumala, Dániel Simig, Surya Ganguli, Ari S. Morcos. "SemDeDup: Data-efficient learning at web-scale through semantic deduplication." (2023) Link

[53]Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, Benjamin Newman, Binhang Yuan, Bobby Yan, Ce Zhang, Christian Cosgrove, Christopher D. Manning, Christopher Ré, Diana Acosta-Navas, Drew A. Hudson, Eric Zelikman, Esin Durmus, Faisal Ladhak, Frieda Rong, Hongyu Ren, Huaxiu Yao, Jue Wang, Keshav Santhanam, Laurel Orr, Lucia Zheng, Mert Yuksekgonul, Mirac Suzgun, Nathan Kim, Neel Guha, Niladri Chatterji, Omar Khattab, Peter Henderson, Qian Huang, Ryan Chi, Sang Michael Xie, Shibani Santurkar, Surya Ganguli, Tatsunori Hashimoto, Thomas Icard, Tianyi Zhang, Vishrav Chaudhary, William Wang, Xuechen Li, Yifan Mai, Yuhui Zhang, Yuta Koreeda. "Holistic Evaluation of Language Models." (2023) Link

[54]Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R. Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, Agnieszka Kluska, Aitor Lewkowycz, Akshat Agarwal, Alethea Power, Alex Ray, Alex Warstadt, Alexander W. Kocurek, Ali Safaya, Ali Tazarv, Alice Xiang, Alicia Parrish, Allen Nie, Aman Hussain, Amanda Askell, Amanda Dsouza, Ambrose Slone, Ameet Rahane, Anantharaman S. Iyer, Anders Andreassen, Andrea Madotto, Andrea Santilli, Andreas Stuhlmüller, Andrew Dai, Andrew La, Andrew Lampinen, Andy Zou, Angela Jiang, Angelica Chen, Anh Vuong, Animesh Gupta, Anna Gottardi, Antonio Norelli, Anu Venkatesh, Arash Gholamidavoodi, Arfa Tabassum, Arul Menezes, Arun Kirubarajan, Asher Mullokandov, Ashish Sabharwal, Austin Herrick, Avia Efrat, Aykut Erdem, Ayla Karakaş, B. Ryan Roberts, Bao Sheng Loe, Barret Zoph, Bartłomiej Bojanowski, Batuhan Özyurt, Behnam Hedayatnia, Behnam Neyshabur, Benjamin Inden, Benno Stein, Berk Ekmekci, Bill Yuchen Lin, Blake Howald, Bryan Orinion, Cameron Diao, Cameron Dour, Catherine Stinson, Cedrick Argueta, César Ferri Ramírez, Chandan Singh, Charles Rathkopf, Chenlin Meng, Chitta Baral, Chiyu Wu, Chris Callison-Burch, Chris Waites, Christian Voigt, Christopher D. Manning, Christopher Potts, Cindy Ramirez, Clara E. Rivera, Clemencia Siro, Colin Raffel, Courtney Ashcraft, Cristina Garbacea, Damien Sileo, Dan Garrette, Dan Hendrycks, Dan Kilman, Dan Roth, Daniel Freeman, Daniel Khashabi, Daniel Levy, Daniel Moseguí González, Danielle Perszyk, Danny Hernandez, Danqi Chen, Daphne Ippolito, Dar Gilboa, David Dohan, David Drakard, David Jurgens, Debajyoti Datta, Deep Ganguli, Denis Emelin, Denis Kleyko, Deniz Yuret, Derek Chen, Derek Tam, Dieuwke Hupkes, Diganta Misra, Dilyar Buzan, Dimitri Coelho Mollo, Diyi Yang, Dong-Ho Lee, Dylan Schrader, Ekaterina Shutova, Ekin Dogus Cubuk, Elad Segal, Eleanor Hagerman, Elizabeth Barnes, Elizabeth Donoway, Ellie Pavlick, Emanuele Rodola, Emma Lam, Eric Chu, Eric Tang, Erkut Erdem, Ernie Chang, Ethan A. Chi, Ethan Dyer, Ethan Jerzak, Ethan Kim, Eunice Engefu Manyasi, Evgenii Zheltonozhskii, Fanyue Xia, Fatemeh Siar, Fernando Martínez-Plumed, Francesca Happé, Francois Chollet, Frieda Rong, Gaurav Mishra, Genta Indra Winata, Gerard de Melo, Germán Kruszewski, Giambattista Parascandolo, Giorgio Mariani, Gloria Wang, Gonzalo Jaimovitch-López, Gregor Betz, Guy Gur-Ari, Hana Galijasevic, Hannah Kim, Hannah Rashkin, Hannaneh Hajishirzi, Harsh Mehta, Hayden Bogar, Henry Shevlin, Hinrich Schütze, Hiromu Yakura, Hongming Zhang, Hugh Mee Wong, Ian Ng, Isaac Noble, Jaap Jumelet, Jack Geissinger, Jackson Kernion, Jacob Hilton, Jaehoon Lee, Jaime Fernández Fisac, James B. Simon, James Koppel, James Zheng, James Zou, Jan Kocoń, Jana Thompson, Janelle Wingfield, Jared Kaplan, Jarema Radom, Jascha Sohl-Dickstein, Jason Phang, Jason Wei, Jason Yosinski, Jekaterina Novikova, Jelle Bosscher, Jennifer Marsh, Jeremy Kim, Jeroen Taal, Jesse Engel, Jesujoba Alabi, Jiacheng Xu, Jiaming Song, Jillian Tang, Joan Waweru, John Burden, John Miller, John U. Balis, Jonathan Batchelder, Jonathan Berant, Jörg Frohberg, Jos Rozen, Jose Hernandez-Orallo, Joseph Boudeman, Joseph Guerr, Joseph Jones, Joshua B. Tenenbaum, Joshua S. Rule, Joyce Chua, Kamil Kanclerz, Karen Livescu, Karl Krauth, Karthik Gopalakrishnan, Katerina Ignatyeva, Katja Markert, Kaustubh D. Dhole, Kevin Gimpel, Kevin Omondi, Kory Mathewson, Kristen Chiafullo, Ksenia Shkaruta, Kumar Shridhar, Kyle McDonell, Kyle Richardson, Laria Reynolds, Leo Gao, Li Zhang, Liam Dugan, Lianhui Qin, Lidia Contreras-Ochando, Louis-Philippe Morency, Luca Moschella, Lucas Lam, Lucy Noble, Ludwig Schmidt, Luheng He, Luis Oliveros Colón, Luke Metz, Lütfi Kerem Şenel, Maarten Bosma, Maarten Sap, Maartje ter Hoeve, Maheen Farooqi, Manaal Faruqui, Mantas Mazeika, Marco Baturan, Marco Marelli, Marco Maru, Maria Jose Ramírez Quintana, Marie Tolkiehn, Mario Giulianelli, Martha Lewis, Martin Potthast, Matthew L. Leavitt, Matthias Hagen, Mátyás Schubert, Medina Orduna Baitemirova, Melody Arnaud, Melvin McElrath, Michael A. Yee, Michael Cohen, Michael Gu, Michael Ivanitskiy, Michael Starritt, Michael Strube, Michał Swędrowski, Michele Bevilacqua, Michihiro Yasunaga, Mihir Kale, Mike Cain, Mimee Xu, Mirac Suzgun, Mitch Walker, Mo Tiwari, Mohit Bansal, Moin Aminnaseri, Mor Geva, Mozhdeh Gheini, Mukund Varma T, Nanyun Peng, Nathan A. Chi, Nayeon Lee, Neta Gur-Ari Krakover, Nicholas Cameron, Nicholas Roberts, Nick Doiron, Nicole Martinez, Nikita Nangia, Niklas Deckers, Niklas Muennighoff, Nitish Shirish Keskar, Niveditha S. Iyer, Noah Constant, Noah Fiedel, Nuan Wen, Oliver Zhang, Omar Agha, Omar Elbaghdadi, Omer Levy, Owain Evans, Pablo Antonio Moreno Casares, Parth Doshi, Pascale Fung, Paul Pu Liang, Paul Vicol, Pegah Alipoormolabashi, Peiyuan Liao, Percy Liang, Peter Chang, Peter Eckersley, Phu Mon Htut, Pinyu Hwang, Piotr Miłkowski, Piyush Patil, Pouya Pezeshkpour, Priti Oli, Qiaozhu Mei, Qing Lyu, Qinlang Chen, Rabin Banjade, Rachel Etta Rudolph, Raefer Gabriel, Rahel Habacker, Ramon Risco, Raphaël Millière, Rhythm Garg, Richard Barnes, Rif A. Saurous, Riku Arakawa, Robbe Raymaekers, Robert Frank, Rohan Sikand, Roman Novak, Roman Sitelew, Ronan LeBras, Rosanne Liu, Rowan Jacobs, Rui Zhang, Ruslan Salakhutdinov, Ryan Chi, Ryan Lee, Ryan Stovall, Ryan Teehan, Rylan Yang, Sahib Singh, Saif M. Mohammad, Sajant Anand, Sam Dillavou, Sam Shleifer, Sam Wiseman, Samuel Gruetter, Samuel R. Bowman, Samuel S. Schoenholz, Sanghyun Han, Sanjeev Kwatra, Sarah A. Rous, Sarik Ghazarian, Sayan Ghosh, Sean Casey, Sebastian Bischoff, Sebastian Gehrmann, Sebastian Schuster, Sepideh Sadeghi, Shadi Hamdan, Sharon Zhou, Shashank Srivastava, Sherry Shi, Shikhar Singh, Shima Asaadi, Shixiang Shane Gu, Shubh Pachchigar, Shubham Toshniwal, Shyam Upadhyay, Shyamolima, Debnath, Siamak Shakeri, Simon Thormeyer, Simone Melzi, Siva Reddy, Sneha Priscilla Makini, Soo-Hwan Lee, Spencer Torene, Sriharsha Hatwar, Stanislas Dehaene, Stefan Divic, Stefano Ermon, Stella Biderman, Stephanie Lin, Stephen Prasad, Steven T. Piantadosi, Stuart M. Shieber, Summer Misherghi, Svetlana Kiritchenko, Swaroop Mishra, Tal Linzen, Tal Schuster, Tao Li, Tao Yu, Tariq Ali, Tatsu Hashimoto, Te-Lin Wu, Théo Desbordes, Theodore Rothschild, Thomas Phan, Tianle Wang, Tiberius Nkinyili, Timo Schick, Timofei Kornev, Titus Tunduny, Tobias Gerstenberg, Trenton Chang, Trishala Neeraj, Tushar Khot, Tyler Shultz, Uri Shaham, Vedant Misra, Vera Demberg, Victoria Nyamai, Vikas Raunak, Vinay Ramasesh, Vinay Uday Prabhu, Vishakh Padmakumar, Vivek Srikumar, William Fedus, William Saunders, William Zhang, Wout Vossen, Xiang Ren, Xiaoyu Tong, Xinran Zhao, Xinyi Wu, Xudong Shen, Yadollah Yaghoobzadeh, Yair Lakretz, Yangqiu Song, Yasaman Bahri, Yejin Choi, Yichi Yang, Yiding Hao, Yifu Chen, Yonatan Belinkov, Yu Hou, Yufang Hou, Yuntao Bai, Zachary Seid, Zhuoye Zhao, Zijian Wang, Zijie J. Wang, Zirui Wang, Ziyi Wu. "Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models." (2023) Link

[55]Contributors OpenCompass. "OpenCompass: A Universal Evaluation Platform for Foundation Models." (2025)

[56]Adiga, Rishabh, Nushi, Besmira, Chandrasekaran, Varun. "Attention Speaks Volumes: Localizing and Mitigating Bias in Language Models." *Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)* (2025): 26403–26423 Link

[57]Yao Lu, Max Bartolo, Alastair Moore, Sebastian Riedel, Pontus Stenetorp. "Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity." (2022) Link

[58]Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, Vaishaal Shankar. "Do ImageNet Classifiers Generalize to ImageNet?." (2019) Link

[59]Jason Wei, Najoung Kim, Yi Tay, Quoc V. Le. "Inverse scaling can become U-shaped." (2023) Link

[60]Emma Strubell, Ananya Ganesh, Andrew McCallum. "Energy and Policy Considerations for Deep Learning in NLP." (2019) Link

[61]OpenAI. "Introducing GPT-5.2." (2025) Link

[62]Siddharth Joshi, Jiayi Ni, Baharan Mirzasoleiman. "Dataset Distillation via Knowledge Distillation: Towards Efficient Self-Supervised Pre-Training of Deep Networks." (2025) Link

[63]An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, Le Yu, Lianghao Deng, Mei Li, Mingfeng Xue, Mingze Li, Pei Zhang, Peng Wang, Qin Zhu, Rui Men, Ruize Gao, Shixuan Liu, Shuang Luo, Tianhao Li, Tianyi Tang, Wenbiao Yin, Xingzhang Ren, Xinyu Wang, Xinyu Zhang, Xuancheng Ren, Yang Fan, Yang Su, Yichang Zhang, Yinger Zhang, Yu Wan, Yuqiong Liu, Zekun Wang, Zeyu Cui, Zhenru Zhang, Zhipeng Zhou, Zihan Qiu. "Qwen3 Technical Report." (2025) Link

[64]Lele Liao, Qile Zhang, Ruofan Wu, Guanhua Fang. "Toward a unified framework for data-efficient evaluation of large language models." (2025) Link

[65]Spearman, Charles. "The proof and measurement of association between two things." *American Journal of Psychology* (2025)

[66]Voorhees, Ellen M.. "Evaluation by highly relevant documents." *SIGIR* (2025)

[67]Buckley, Chris, Voorhees, Ellen M.. "Retrieval evaluation with incomplete information." *SIGIR* (2025)

[68]Sakai, Tetsuya. "On the reliability of information retrieval metrics." *SIGIR* (2025)

[69]Lambert, Nathan. "Good Researchers Obsess Over Evals: The Story of OLMo 3 (Post-Training), Told Through Evals." (2025) Link

[70]Team Olmo, :, Allyson Ettinger, Amanda Bertsch, Bailey Kuehl, David Graham, David Heineman, Dirk Groeneveld, Faeze Brahman, Finbarr Timbers, Hamish Ivison, Jacob Morrison, Jake Poznanski, Kyle Lo, Luca Soldaini, Matt Jordan, Mayee Chen, Michael Noukhovitch, Nathan Lambert, Pete Walsh, Pradeep Dasigi, Robert Berry, Saumya Malik, Saurabh Shah, Scott Geng, Shane Arora, Shashank Gupta, Taira Anderson, Teng Xiao, Tyler Murray, Tyler Romero, Victoria Graf, Akari Asai, Akshita Bhagia, Alexander Wettig, Alisa Liu, Aman Rangapur, Chloe Anastasiades, Costa Huang, Dustin Schwenk, Harsh Trivedi, Ian Magnusson, Jaron Lochner, Jiacheng Liu, Lester James V. Miranda, Maarten Sap, Malia Morgan, Michael Schmitz, Michal Guerquin, Michael Wilson, Regan Huff, Ronan Le Bras, Rui Xin, Rulin Shao, Sam Skjonsberg, Shannon Zejiang Shen, Shuyue Stella Li, Tucker Wilde, Valentina Pyatkin, Will Merrill, Yapei Chang, Yuling Gu, Zhiyuan Zeng, Ashish Sabharwal, Luke Zettlemoyer, Pang Wei Koh, Ali Farhadi, Noah A. Smith, Hannaneh Hajishirzi. "Olmo 3." (2025) Link

[71]Adhiraj Ghosh, Sebastian Dziadzio, Ameya Prabhu, Vishaal Udandarao, Samuel Albanie, Matthias Bethge. "ONEBench to Test Them All: Sample-Level Benchmarking Over Open-Ended Capabilities." (2025) Link

[72]DatologyAI, Carranza, Aldo, Deng, Alvin, Maini, Pratyush, Razzak, Muhammed, Urbanek, Jack, Abbas, Amro, Burstein, Paul, Cao, Ning, Goyal, Priya, McGrath, Joshua, Pan, Fan, Wills, Josh, Yin, Haoli, Kada, Vineeth, Shah, Vishwa, Veerendranath, Vishruth, Gaza, Bogdan, Morcos, Ari, Leavitt, Matthew. "DatologyAI Technical Deep-Dive: Curating Our Way Curation to a Billion-State-of-the-Art Text Dataset." (2024) Link

[73]DatologyAI, Abbas, Amro, Wills, Josh, Yin, Haoli, Burstein, Paul, Cao, Ning, Carranza, Aldo, Deng, Alvin, Goyal, Priya, Maini, Pratyush, McGrath, Joshua, Pan, Fan, Urbanek, Jack, Kada, Vineeth, Razzak, Muhammed, Shah, Vishwa, Veerendranath, Vishruth, Gaza, Bogdan, Morcos, Ari, Leavitt, Matthew. "DatologyAI Technical Deep-Dive: Image-Text Data Curation at the Billion-Sample Scale." (2024) Link

[74]moondream. "RefCOCO-M: Refined Referring Expression Segmentation." (2025) Link

[75]Ling Fu, Zhebin Kuang, Jiajun Song, Mingxin Huang, Biao Yang, Yuzhe Li, Linghao Zhu, Qidi Luo, Xinyu Wang, Hao Lu, Zhang Li, Guozhi Tang, Bin Shan, Chunhui Lin, Qi Liu, Binghong Wu, Hao Feng, Hao Liu, Can Huang, Jingqun Tang, Wei Chen, Lianwen Jin, Yuliang Liu, Xiang Bai. "OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning." (2025) Link

[76]Renrui Zhang, Dongzhi Jiang, Yichi Zhang, Haokun Lin, Ziyu Guo, Pengshuo Qiu, Aojun Zhou, Pan Lu, Kai-Wei Chang, Peng Gao, Hongsheng Li. "MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?." (2024) Link

[77]Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Houxing Ren, Aojun Zhou, Mingjie Zhan, Hongsheng Li. "Measuring Multimodal Mathematical Reasoning with MATH-Vision Dataset." *The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track* (2024) Link

[78]Acharya, Manoj, Kafle, Kushal, Kanan, Christopher. "TallyQA: Answering Complex Counting Questions." *AAAI* (2025)

[79]DatologyAI, :, Pratyush Maini, Vineeth Dorna, Parth Doshi, Aldo Carranza, Fan Pan, Jack Urbanek, Paul Burstein, Alex Fang, Alvin Deng, Amro Abbas, Brett Larsen, Cody Blakeney, Charvi Bannur, Christina Baek, Darren Teh, David Schwab, Haakon Mongstad, Haoli Yin, Josh Wills, Kaleigh Mentzer, Luke Merrick, Ricardo Monti, Rishabh Adiga, Siddharth Joshi, Spandan Das, Zhengping Wang, Bogdan Gaza, Ari Morcos, Matthew Leavitt. "BeyondWeb: Lessons from Scaling Synthetic Data for Trillion-scale Pretraining." (2025) Link

[80]Jinyan Su, Jennifer Healey, Preslav Nakov, Claire Cardie. "Between Underthinking and Overthinking: An Empirical Study of Reasoning Length and correctness in LLMs." (2025) Link

[81]Yue Wang, Qiuzhi Liu, Jiahao Xu, Tian Liang, Xingyu Chen, Zhiwei He, Linfeng Song, Dian Yu, Juntao Li, Zhuosheng Zhang, Rui Wang, Zhaopeng Tu, Haitao Mi, Dong Yu. "Thoughts Are All Over the Place: On the Underthinking of o1-Like LLMs." (2025) Link

[82]Yuyang Wu, Yifei Wang, Ziyu Ye, Tianqi Du, Stefanie Jegelka, Yisen Wang. "When More is Less: Understanding Chain-of-Thought Length in LLMs." (2025) Link

[83]Andreas Hochlehnert, Hardik Bhatnagar, Vishaal Udandarao, Samuel Albanie, Ameya Prabhu, Matthias Bethge. "A Sober Look at Progress in Language Model Reasoning: Pitfalls and Paths to Reproducibility." (2025) Link