Datology AI
Research Updates

Brevity is the Soul of Inference Efficiency: Inducing Concision in VLMs via Data Curation

A commonly-overlooked lever on inference cost is how much the model says, and you can improve it substantially with pretraining data curation.

Written by

DDatologyAI
Published on

Share on

Share on TwitterShare on FacebookShare on LinkedIn

*NOTE: If you prefer a PDF, the ArXiv version of this post can be found here.

Abstract

Inference efficiency is typically pursued by shrinking the model: distillation, pruning, quantization, and sparse routing each lower per-token cost while treating token count as fixed. But output length has been inflating, and it is precisely the component the standard toolkit leaves untouched. Here, we argue that brevity is the missing inference-efficiency lever, and that pretraining data curation is a practical way to pull it: a model trained on concise, correct data learns to answer in fewer tokens; i.e. it has a lower Cost-of-Pass. We apply our VLM curation pipeline to the MAmmoTH-VL single-image subset, and compare models trained on our curated data, the standard MAmmoTH-VL data, and external open-weight frontier VLMs. On a controlled 20-evaluation set and 14 VLMs at 1B–4B activated parameters, we hold output length fixed with a per-model regression, separating brevity from quality, and price models in FLOPs per correct answer. Curation buys a 35× Cost-of-Pass advantage over the most verbose 4B comparator (Qwen3.5-4B) within ~1 pp of accuracy (0.41 vs 14.58 TFLOPs per correct answer; 0.691 vs 0.704 mean accuracy). Curation also buys a +17.55-percentage-point matched-length accuracy gain over the uncurated baseline, a gain that grows with model scale (from +16.7 pp at 1B to +21.2 pp at 4B). This brevity improvement concedes no quality: generic verbosity buys no accuracy at any capability or scale, and the window where reasoning-structured verbosity still earns its tokens shrinks from 4 of 8 capability groups at 2B to 1 of 8 at 4B. Per example, the concise model even reaches correct answers the verbose reasoning model misses, marking reasoning as a distinct curation target rather than something brevity gives up. Inference efficiency in this regime is a tokens-per-correct problem, and brevity is the lever that targets it directly.

Figure 1: The paper in one figure: brevity is the soul of inference efficiency, and data curation is how you enact it. On a frontier pool of open-weight 2B–4B vision-language models, our curated models (Datology, blue) answer in far fewer tokens than verbose comparators and cost far less per correct answer, with no loss of quality. Left: mean output length per response (the x-axis is broken to keep the cluster legible); the curated models emit ∼30 tokens per response where Qwen3.5-4B emits 1,284, a ∼40× gap. Right: error rate (1−accuracy) against FLOPs per correct answer; the curated models hold the low-error, low-cost corner of the frontier and reach a 𝟑𝟓× Cost-of-Pass advantage over Qwen3.5-4B within ~1 pp of accuracy.

1. Introduction

“I have made this letter longer than usual because I have not had the time to make it shorter.”

— Blaise Pascal, Lettres provinciales (1657)

The conventional framing of inference efficiency is to shrink the model. The standard toolkit (distillation, pruning, quantization, MoE routing, speculative decoding, early-exit, and cascading) shares one implicit premise: the output is a fixed quantity, and the only question is how cheaply each token of it can be produced. Optimize FLOPs per token; accept the token count. The metric we use throughout is Cost-of-Pass: the compute, and ultimately the dollars, a model spends per correct answer (§3.1).

Shrinking the model is the usual move. A training-time gain can be cashed out along three fungible axes (speed, quality, and size) that together form the training-efficiency triangle. A result sold as an inference-efficiency win has almost always been cashed out as size: fewer activated parameters. This leaves a fourth axis untouched: output length, which has remained an outcome when it could be an objective. It also sits off the standard scaling-law axes of compute, parameters, and training tokens, so a model can ship with a 10× verbosity increase and no number on its model card will move, even as test-time compute becomes a scaling axis in its own right.

Pascal names what that premise misses. The long letter is the easy one to write; the short one takes work. Given the downward pressure on inference cost, the real question is not whether to pay for brevity but where: at decode time, where length penalties and sampling controls are bolted onto a model not trained to be brief, or once and up front, in the data the model learns from.

The inference bill is coming due regardless, because token generation is increasing quickly, both in per-generation length and total quantity: Epoch AI (2025) tracks roughly 2.2× annual growth in mean output length on non-reasoning models and roughly 5× on reasoning models, which now emit about 8× more tokens than their non-reasoning peers. OpenAI ships reasoning-effort as an explicit product dial, where the default “medium” to “high” setting is a 1.6× bump in output tokens [34]. Sundar Pichai opened Google I/O 2026 with a token counter rather than a model: 9.7 trillion tokens per month in 2024, 480 trillion in 2025, 3.2 quadrillion today, roughly 330× growth in two years [26]. The per-token gains from the standard toolkit are being recycled into longer outputs and more queries, so total inference bills grow rather than shrink.1

A meaningful slice of reported quality gains is bought with test-time compute. The “Price of Progress” analysis [12] finds that roughly half of GPQA-Diamond progress at the frontier tracks increased inference spend, not algorithmic improvement; controlling for hardware, algorithmic efficiency improves at about 3× per year, well below the headline 5–10×. [25] presses the same case from inside a frontier lab: as models get better at spending test-time compute, a single benchmark number says less each release, and capability should be reported against a token or cost budget. [1] show that in chat RLHF a purely length-based reward reproduces most of the apparent gain of RLHF over SFT: length is not a side effect of RLHF, it is most of what RLHF does in the helpfulness setting. Dubois et al. (2024) show that controlling AlpacaEval for length shifts rankings substantially, with open-source RLHF’d models dropping the most. The common thread is that output length accounts for gains usually credited to better algorithms or training, not to length itself.

So there is a second axis worth optimizing: how many tokens does the model spend to be correct? At fixed model size and fixed accuracy, halving the output length roughly halves the dollar cost and the wall-clock latency, and more than halves the cost for long outputs, once attention’s quadratic-in-length term is counted (§M.2). No quantization scheme or speculative decoder can give this, because those reduce cost per token; this reduces tokens per correct answer. And the two reductions compose: parameter count and output length are independent factors of the inference bill, so a 2× smaller model that is also 2× less verbose is 4× cheaper per correct answer, and a single curation move can drive both.

Verbosity can be attacked at several layers: output-layer length penalties in decoding [4, 5], decoding controls and length-aware sampling [6, 7], RL-side length shaping [1, 8], and post-hoc summarization or two-pass generation (see §2 for the surveyed alternatives). This is typically where Pascal’s letter gets shortened, and where the time gets spent.

We also argue for inference efficiency through brevity, but achieved through data curation at the pretraining stage. A data intervention internalizes the inference-time intervention into the model: trained on well-curated data, the model needs no external length constraint at decode time, and any such constraint then composes with that data-internalized brevity instead of fighting a model that was never trained to be brief.

This work is a follow-up to DatologyAI Team et al. (2026), which introduced our VLM data-curation pipeline applied to the public MAmmoTH-VL single-image training corpus [22]. The models here are the same: DatologyAI models trained on the curated mixture, and models trained on the MAmmoTH-VL baseline, holding everything else fixed. We also evaluate against open-weights frontier models in the Qwen, InternVL, and Isaac families. The prior paper’s headline was the accuracy effect of curation: +11.7 pp on average across a 20-eval suite at matched compute, with a side observation that response FLOPs were also lower at every scale. What it left unresolved is where the inference-cost win comes from: does it flow from fewer tokens at fixed accuracy (brevity), more accurate tokens at fixed length (quality at matched verbosity), or both? Does verbosity convey any advantages? Are there meaningful differences between reasoning and non-reasoning verbosity? The present work addresses these questions directly, using the same 20-eval VLM pool spanning 1B–4B parameter models:

  • Curation cuts Cost-of-Pass. Datology 4B answers correctly for 0.41 TFLOPs against 14.58 for the most verbose 4B comparator, Qwen3.5-4B, within ~1 pp of accuracy (0.691 vs 0.704): a 35× gap. Per-token FLOPs are identical at fixed parameters by construction, so what separates the bills is tokens per correct answer (§3.1).
  • Accuracy holds at matched length. At the same output length on the same benchmark, the curated model is correct +17.55pp more often than its uncurated paired baseline (pooled across scales), and the effect grows with scale, from +16.7 pp at 1B to +21.2 pp at 4B (§3.2).
  • Cheaper per correct, even when less accurate. Against the verbose Qwen pool the curated model gives up 4–7 pp of length-controlled accuracy for the order-of-magnitude cost advantage; against both InternVL variants it wins on both axes. The Datology 4B vs Qwen3.5-4B pair can’t be length-matched at all (2.2% overlap in output length distributions): Qwen3.5-4B’s accuracy is bundled with its chain-of-thought verbosity (§3.3, §3.4).
  • Verbose tokens rarely earn their cost, and less so at scale. Across concise non-reasoning (Datology), verbose non-reasoning (Mammoth), and verbose reasoning (Qwen3.5) models matched at 2B and 4B, generic verbosity buys no accuracy anywhere, while reasoning-structured verbosity beats curation on a shrinking subset of the eight capability groups: 4 of 8 at 2B, 1 of 8 at 4B, where the concise model instead wins 4 of 8 (§3.4).
  • Reasoning is a separate curation target from brevity. Per example, the concise curated model reaches correct answers the verbose reasoning model misses, and vice versa. The two have different error sets, so curation does not simply compress reasoning into fewer tokens; reasoning is a separate capability to curate for (§3.4).

Together, these results demonstrate that brevity is a viable inference-efficiency lever, and that data curation at pretraining is a viable, currently-underweighted access path to it.

Pascal paid for brevity one letter at a time; curation pays once and collects on every generation thereafter.

The verbosity confound has been studied only in pieces, scattered across text-only chat, reasoning models, VLMs, and, increasingly, the industry’s macro cost picture. Five threads run through that literature, and our results plug into each: output length is climbing fast; longer responses are not reliably better ones, and in chat RLHF most of the apparent gain over SFT is length rather than content; reasoning models make the tradeoff concrete, coupling accuracy to a token bill the field is now learning to price; the same length confound almost certainly reaches VLMs, though it is barely measured there; and paying for brevity once, in the training data, is a form of amortized inference. Across all five, output length is an under-acknowledged confound in how model quality is read, and curation is a lever for it that almost no one is pulling.

2.1. Text length is inflating, fast

Model outputs are getting longer year over year, reasoning and non-reasoning models alike. Epoch AI (2025) put numbers on the trend: across benchmark questions, non-reasoning models grow their mean output roughly 2.2× per year and reasoning models roughly 5×. Their corpus is mostly text and reasoning models, but the inflation is broad enough that we take it as the backdrop for our vision-language results: cutting verbosity matters only because the field is busy inflating it.

2.2. In chat RLHF, length explains most of the apparent gain

Longer answers can look better without being better. [1] give the cleanest precedent for this length-confound: an apparent quality gain that, on inspection, is mostly a length gain rather than a content gain. In chat RLHF, a reward that depends on nothing but response length reproduces most of the downstream improvement of RLHF over SFT: the reward model inherits a length bias from the preference data it was fit on, so optimizing against it largely optimizes for length. Dubois et al. (2024)’s Length-Controlled AlpacaEval sharpens the point from the eval side: controlling for length raises agreement with the LMSYS Arena from 0.94 to 0.98, and a model’s drop under that control becomes a measure of how much its score was riding length rather than quality, a measure which affects open RLHF’d models most. [3] show the bias is both correctable and pervasive: length-residualization adds +3.11 to the RewardBench average across 33 reward models, and GPT-4-as-judge carries the same preference for verbosity, so it propagates wherever a judge is used. The shared claim, across all three, is that most of an apparent quality gap can turn out to be length. But the evidence is entirely chat-side and RLHF-shaped; whether the same artifact inflates quality numbers in vision-language models has not been tested.

2.3. Reasoning models, cost-per-correct, and the longitudinal picture

A cluster of 2025–2026 work measures the unit economics of verbosity directly. OckBench [9], which proposes per-token intelligence as an evaluation axis, finds that same-size 7B reasoning models with similar accuracy can differ 3.3× in tokens and 5× in latency, and coins the “Overthinking Tax” for the verbosity of distilled small models. “Decomposing Reasoning Efficiency” [10] separates accuracy from token cost across 25 models on CogniLoad [11], a synthetic reasoning benchmark with independently tunable chain-length, difficulty, and distractor density, and finds the two rankings only loosely aligned (Spearman ρ=0.63): verbalization overhead, the share of tokens a model spends beyond what the answer needs, varies roughly 9× across models and is only weakly tied to scale.

Nous Research’s “Measuring Thinking Efficiency in Reasoning Models” (Aug 2025) gives the clearest industry-side statement of the problem [33]: reasoning models spend hundreds of tokens on simple knowledge questions, and closed-weight frontier models are substantially more token-efficient than their open-weight peers.2 The “Price of Progress” analysis [12] is the headline cost number behind this post: frontier dollars-per-benchmark fall 5–10× per year, but algorithmic efficiency improves only ~3×, and roughly half of GPQA-Diamond progress at the frontier is bought with additional inference spend rather than better algorithms. [13]’s Cost-of-Pass supplies one of the units we adopt, the expected dollars per correct solution, and shows that inference-time techniques such as majority voting and self-refine rarely justify their cost.

[25] makes the case normatively: as models get better at spending test-time compute, benchmark performance keeps climbing without a clear plateau, so a single accuracy number is increasingly uninformative and models should be reported as performance-versus-compute curves rather than scalars. He also notes that raw token counts are not comparable across models (tokenizers, speeds, and per-token costs all differ), which is precisely why we denominate cost in FLOPs-per-correct (§M.2).

Together these establish the units on which verbosity’s price can be read directly: accuracy-per-token and cost-per-correct. What none of them measures is whether pretraining data curation moves those units.

2.4. VLM-specific evidence, and the open gap

The VLM-side evidence is thinner, but it points the same way. [14] is the closest analog: as an MLLM’s caption grows, the model leans progressively more on its own generated text and less on the image, and longer captions hallucinate more. They also show that strong VQA performance does not imply strong detailed-captioning performance, because VQA evals implicitly control for length while captioning does not, the same asymmetry our quality measurement (§M.3) is built to respect. CREPE [15] makes the point adversarially: on 17 compositional VLM benchmarks, content-blind heuristics keyed on surface length rival CLIP-class models, so length alone games the benchmark. [16] show that hallucination-mitigation methods trade precision for recall on CHAIR (the standard object-hallucination metric for image captioning), so both are gameable by adjusting verbosity. And multimodal judges inherit the text-judge verbosity bias (Chen et al., 2024a), so any VLM eval routed through a judge carries the confound. Unfortunately there is no clean VLM analog to Singhal et al.’s result that a purely length-based reward reproduces most of the RLHF gain. We position §3.2’s length-controlled accuracy regression and hallucination decomposition to be that analog from the data-curation side.

2.5. Amortized inference: brevity as cached cognition

Cognitive science already has a name for paying a cost once so you needn’t re-derive an expensive answer every time: amortized inference [18]. The basic principle is that an intelligent system avoids re-deriving expensive inferences from scratch and instead learns a cheap, reusable approximation. Through that lens, a short correct answer from a well-trained model is amortized inference: the expensive deliberation happened during training, and the forward pass is the cached approximation.3 Data curation is the learning signal that decides what gets amortized: show the model concision and it learns to produce the answer directly rather than re-deriving it token by token at decode time. The everyday version of the same move is the shift from System 2 to System 1: a deliberate, multi-step computation, run often enough, compiles into a fast automatic response, the way a chess expert sees a strong move at a glance instead of searching for it.

This reframes the cost question. Test-time compute is un-amortized inference: the model paying for deliberation at every query, every user, every time. Curation amortizes that cost into the weights, paid once at training and spread across every generation the model later produces. It also turns inference cost into a training signal: where production usage shows the model repeatedly spending tokens on a recurring pattern, that pattern becomes a curation target for the next pass, and the next model is leaner still, a data flywheel in which deployment cost feeds back into the training set. And it meets the obvious Bitter Lesson objection [19] in its own terms: curation does not add hand-built deliberation, it amortizes the human-rater and prior-model compute already latent in the data, which is exactly the kind of cost a scalable method is meant to absorb.

The analogy flatters deep learning research in one respect. Biological amortization is mostly undirected: evolution compiles useful priors into reflexes over many generations, and within a lifetime the brain caches whatever the world happens to present, the way a hidden figure in a two-tone image, once resolved, is seen instantly and permanently from a single glance. Deep learning gets to amortize deliberately: we choose which inferences get compiled into the weights, across the whole training distribution, on every run. We are not waiting for evolution, or for the right experience to come along; we are running amortization on demand.

3. Results

3.1. Cost-of-Pass: thoughtful data curation is Pareto-dominant on inference cost

Inference efficiency is conventionally reported in dollars per million tokens. That prices the tokens a model emits but says nothing about how many it needs, or whether they are right. Two quantities actually drive the bill: cost per answer (the compute each response burns) and accuracy (how often that response is correct), and we report both throughout. We also summarize them with a single ratio, cost per correct answer (cost per answer divided by accuracy), which has the right units for what a correct answer costs. We treat it as a convenient aggregate, not the one true metric: dividing by accuracy prices a point of accuracy linearly, and whether that exchange rate matches an application’s value of being right is a judgment we leave to the reader.

We measure cost per answer in FLOPs, via the decode-dominated proxy of §M.2: a response of n̄out tokens from a dense model with N active parameters costs about 2Nn̄out FLOPs, so cost per correct =2Nn̄out/accuracy. This proxy is faithful to the small dense VLMs we study, where decode compute scales cleanly with active parameters and output length, and a rough lower bound elsewhere: it omits attention, which is quadratic in sequence length, along with the bandwidth and memory limits that bite at longer sequences and larger or sparsely-activated models (§M.2). Because the omitted attention cost grows with output length, the proxy understates the verbose models’ cost the most, so the Cost-of-Pass gaps we report are conservative. This quantity is hardware-independent; a dollar translation under a self-hosted H100 roofline is in §M.2.

Headline arithmetic: the curated model is cheapest per correct at every matched scale.

Table 1 reports FLOPs per correct across 14 open-weight VLMs at 1B–4B parameters, averaged over the 20-eval frontier suite (§M.1). Rows are ordered by FLOPs/correct ascending.

Table 1: The curated models cluster at the cheapest end of the Cost-of-Pass frontier. Datology and Mammoth rows are paired by scale (matched controls); at every scale the curated model is both cheaper per correct and more accurate than its control, and the curated checkpoints are three of the five cheapest rows. The one model that undercuts a curated checkpoint on cost, InternVL3-2B (0.21 vs Datology 2B’s 0.27 TFLOPs), does so only by giving up ~10 pp of accuracy (§3.3). FLOPs computed as 2Nn̄out/accuracy per §M.2.

Both curation and verbosity are demonstrated paths to better models: curation produces Pareto improvements (cheaper and more accurate); verbosity produces accuracy gains at the cost of more tokens. The matched-scale pairs measure the magnitude of the first; the frontier comparators show what the second costs.

At every paired scale, the curated model is both cheaper per correct and more accurate than its Mammoth control. At 1B, the curated model costs 2.2× fewer FLOPs per correct (0.10 vs 0.23 TFLOPs) and is 13.3 pp more accurate. At 2B, 1.6× fewer FLOPs (0.27 vs 0.43) and 11.9 pp more accurate. At 4B, 2.3× fewer FLOPs (0.41 vs 0.96) and 14.2 pp more accurate.

Against the frontier-comparable external models, the curated 4B at 0.41 TFLOPs/correct sits below every comparator at its weight class: 2.7–3.2× below InternVL3.5-4B and Qwen3-VL-4B (both within 3 pp of its accuracy), and 35× below Qwen3.5-4B within ~1 pp of accuracy (0.691 vs 0.704). Notably, Datology 4B is cheaper per correct than Mammoth 2B (0.41 vs 0.43 TFLOPs) despite carrying more than twice the activated parameters — a direct demonstration that the brevity lever can overpower the capacity lever at this scale.

Verbosity is the bill.

The cost separation at fixed scale cannot be explained by per-token efficiency. At 4B activated parameters, the FLOPs per output token (=2N) are (near) identical across Datology, Mammoth, Qwen3.5, Qwen3-VL, and InternVL3.5. The variable that differs is the number of tokens emitted to reach a correct answer: 32.5 for Datology, 59.6 for Mammoth, 85.4 for InternVL3.5, 108.9 for Qwen3-VL, and 1,284 for Qwen3.5-4B.4 The 35× gap in FLOPs/correct between Datology 4B and Qwen3.5-4B is roughly the 40× ratio of their mean output lengths, attenuated by Qwen3.5-4B’s slightly higher accuracy. Verbosity, not capacity, is the cost variable at this scale.

Compounding: a curated 1B is 143× cheaper than Qwen3.5-4B.

The brevity lever and the capacity lever stack multiplicatively when both are pulled. Datology 1B at 1.0 B activated parameters and 32.4 mean output tokens lands at 0.10 TFLOPs/correct, the lowest in the table, 9.4× below Mammoth 4B and 143× below Qwen3.5-4B. The trade-off is 5–7 pp of accuracy below the larger Datology checkpoints (0.638 vs 0.660 and 0.691 for 1B, 2B, and 4B, respectively), which a deployment may or may not be willing to absorb depending on the application. The compounding is what makes the curated 1B competitive: a small model trained on thoughtfully-curated data sits, in inference-cost terms, an order of magnitude below frontier 4B models.

FLOPs per correct already prices each model with its parameter count folded in. Two further views confirm the gap isn’t an artifact of that accounting: tokens per correct answer strips the parameters back out, isolating the share of the gap that is pure verbosity, and a dollar translation puts it in operating-cost terms.

Tokens per correct, and dollars: the same ~40× gap.

OckScore (OckBench’s per-token-intelligence framing, §2.3; [9]) places the same comparison on a hardware-free axis: tokens per correct answer.

Figure 2: OckScore: the curated models own the low-cost, low-error corner. Tokens per correct answer (x, broken to keep the cluster legible) against error rate (y); both axes run so that lower is better, placing the best models at the bottom-left. The curated checkpoints cluster there at ∼47–64 tokens per correct, while the verbose comparators fan up and to the right, with Qwen3.5-4B alone at ∼1,820, a ∼30–40× token gap at comparable error.

Datology 4B at ~47 tokens/correct compares to ~244 for Qwen3.5-2B and ~1,823 for Qwen3.5-4B (full per-model table in Table˜D.2), the same ~40× ratio the matched-scale view reports above. Because the dollar version of Cost-of-Pass is a constant-N affine transform of token counts, the ratio must transfer; what OckScore confirms is that the cost gap is not a hardware-pricing artifact.

Translated to dollars under the §M.2 H100 roofline, the same ratios hold in operating-cost terms: a million correct answers cost about $1.34 from the curated 4B and $47 from Qwen3.5-4B (and $0.33 from the curated 1B), roughly the price of a soda versus a tank of gas. Full per-model figures and the roofline arithmetic are in §M.2; the dollar cost is linear in FLOPs/correct at fixed batch and bandwidth, so the dollar ratios equal the FLOPs ratios and the 35× gap is invariant to reasonable hardware assumptions.

Thoughtful data curation is Pareto-dominant on inference cost. At every matched scale, the curated model raises accuracy while lowering FLOPs-per-correct versus its uncurated control (1.6×–2.3× FLOPs reduction; 11.9–14.2 pp accuracy gain). At 4B, the curated model sits below every external comparator at its weight class — 2.7×–3.2× below InternVL3.5-4B and Qwen3-VL-4B at comparable accuracy, and 35× below Qwen3.5-4B within ~1 pp of accuracy.

3.2. Curation raises matched-length accuracy, more so at scale

The previous section’s cost claim rests on an assumption it hasn’t yet tested: that accuracy on the eval suite reflects accuracy at any output length. If curation’s accuracy gain over its uncurated baseline is really just shorter answers being easier to score, that gain (and the cost advantage built on it) is a metric artifact. To resolve this confound, we fit a length-controlled regression on the pairs of models trained on curated vs uncurated data. A separate worry, that brevity flatters the curated model against external models, is taken up in §3.3: at matched length the verbose Qwen models are more accurate than the curated model, not less, so its cost-per-correct edge isn’t bought with easier-to-grade short answers.

The setup holds every confound except curation fixed. Datology and Mammoth are paired at three scales (1B, 2B, and 4B). Within each pair: same base LLM, same VL training recipe, same context length, same eval pipeline. Only the pretraining data curation differs.

Curated models are correct 17.6 pp more often after controlling for length.

We fit a logistic GLM on per-model output tokens, benchmark fixed effects, and scale fixed effects, pooled across all three matched scales:

logitPr(correct)=b0+c⋅is_datology+cr(length_tokens,df=4)+C(benchmark)+C(scale)

on every per-example generation across the 18 grouped frontier evals (the 20-eval suite minus the two aggregate-only evals, BLINK and BrandID; §M.1), with standard errors clustered at the example level. The coefficient c is converted to an average marginal effect (AME) in percentage points: the expected change in the probability of a correct answer from switching the curated indicator on, averaged over the observed data, so the headline reads as a direct pp gap (Table˜2). A detailed methodological description is in appendix (§M.3).

Table 2: Length-controlled accuracy gap for the curated indicator, pooled across the three matched scales. The linear-probability-model (LPM) cross-check refits the same specification with ordinary least squares and HC1 robust errors; the sensitivity row reruns the regression with output length measured in characters rather than tokens. Both agree with the logistic AME to within 0.4 pp. The GLM specification is in §M.3.

Reading.

At the same output length, on the same benchmark, the curated model is correct ~17.6 percentage points more often than its uncurated paired baseline, averaged across the three matched model sizes. Curation induces both brevity and accuracy: the curated models are short kings. The length-accuracy spline, the GLM’s smooth fit of accuracy against output length (Figure˜B.2), shows why the controlled and raw gaps nearly coincide: both models’ outputs cluster at the short, flat end of the curve, so length contributes almost nothing to the raw gap.

Curation imparts greater benefit for larger models.

Per-model-size regressions (one fit per matched scale, no scale fixed effect) reveal real heterogeneity: the curation effect is meaningfully larger at 4B (+21.2 pp) than at 2B (+15.6 pp), a ~6 pp gap that is tightly estimated at both scales, with 1B in between (+16.7 pp). This tracks §3.1’s matched-scale FLOPs-per-correct ratios, which are also largest at 4B (2.3×) and smallest at 2B (1.6×). Two views of the same scaling story.

Cost arithmetic and length-controlled accuracy track each other.

§3.1’s matched-scale cost arithmetic ($0.87 vs $1.40 per million correct answers at 2B, a 1.6× ratio) is the same claim seen through the dollar denominator. The Cost-of-Pass numerator is correct answers per generation, so the +17.6 pp pooled AME is what makes the cost gap exist. Without the length-controlled accuracy edge, the matched-scale cost comparison would dissolve to “same accuracy at same length, same cost.” Instead, the curated model produces between 16 and 21 percentage points more correct answers per generation at every scale, at fewer tokens per generation, on the same compute budget.

The per-scale FLOPs/correct ratios (Mammoth ÷ Datology, from §3.1) and the per-scale length-controlled AMEs track each other across scales (Table˜3).

Table 3: Cost and quality track each other across scales. Per-scale FLOPs-per-correct ratios (from §3.1) alongside the per-scale length-controlled AMEs. Both are largest at 4B and smallest at 2B.

Curation pays off more, in both quality and inference cost, at the larger scales, across two independent measurement axes.

Captioning quality holds, and the hallucination win is brevity.

The pooled AME folds in DetailCaps via an LLM-judge precision signal (§M.4); long-form captioning is the subset where correctness is hardest to pin down (captions are scored by judged precision and recall rather than exact match), so it warrants its own audit. On the controlled 2B pair, curation improves both DetailCaps precision (+0.018) and CAPability recall (+0.006) while cutting caption length by 59% and hallucinated mentions per caption by 19%. A length-versus-rate decomposition attributes the hallucination win to brevity rather than to grounding each claim more faithfully: per character, the curated model’s grounding is similar to Mammoth’s. The full audit, table, and decomposition are in Appendix A.

At the same output length, on the same benchmark, the curated model is correct ~17.6 pp more often than its uncurated paired baseline. The 35× cost advantage isn’t a length artifact — the length-controlled accuracy edge holds at every matched scale we tested, growing from +16.7 pp at 1B to +21.2 pp at 4B (95% CI on the pooled effect: [+13.3,+21.8]). Curation buys quality and efficiency at every matched scale, and its impact grows with model size.

3.3. Cheaper by the answer: cost-efficient against six of seven frontier-comparable models

The previous analyses showed that our curation substantially improves length-controlled accuracy relative to the baseline Mammoth data. But does the curated model’s accuracy edge survive against external frontier models trained under different recipes, with different verbosity priors? Here we widen the lens to the frontier-comparable pool of open-weight 2B and 4B external models, where everything but parameter count varies. Across that pool the curated model holds the cost-efficient Pareto front against six of seven comparators, giving up ground at fixed length only to models that cost several times more per correct answer.

Length-controlled accuracy: competitive at fixed length, decisive on cost.

We re-run the same length-controlled regression from §3.2, now pairwise against each external model, restricting to pairs whose output-length distributions overlap meaningfully (at least 10% of one model’s responses fall within the other’s [p10,p90] range). The Datology 4B vs Qwen3.5-4B pair fails that test at 2.2% overlap; the next subsection shows why that failure is itself a result.

The average marginal effect (AME) is best interpreted as a controlled counterfactual, not a deployment number. A −7 pp AME against Qwen3.5-2B means that if both models were constrained to emit the same number of tokens, Qwen would be roughly 7 pp more accurate on the overlapping range. But the models are not deployed at equal length, and forcing equal length would erase the brevity that makes the curated model cheap in the first place. At their actual operating points the curated model produces more correct answers per FLOP (§3.1): the AME is the stress test, Cost-of-Pass is the bill. On that stress test the curated model wins outright against four of seven external models (both InternVL3.5 variants, InternVL3, and Perceptron) and trails three — the Qwen family — by 4–7 pp, every one of which it still beats decisively on cost, as Table˜4 sets out. Figure˜B.3 plots the per-pair length-accuracy splines behind these AMEs.

Qwen3.5-4B has no concise operating point.

A length-controlled comparison needs both models to actually appear at the same output lengths; for Datology 4B and Qwen3.5-4B, they almost never do. The curated model’s median output is 3 tokens against Qwen3.5-4B’s 626, and only 2.2% of Qwen3.5-4B’s responses fall inside the curated model’s central [p10,p90] length range (Table˜D.3).

Per benchmark, the disparity is even sharper. On 3DSRBench, Qwen3.5-4B outputs are 1,276×longer than Datology’s; on AI2D, 1,222× longer; on MathVista, 868× longer; on MMBench, 716× longer. The two models don’t operate at comparable lengths in our data.

This is the verbosity claim made concrete. Qwen3.5-4B’s accuracy comes bundled with its chain-of-thought verbosity. There is no “concise Qwen3.5-4B” operating point in the data: to access Qwen3.5-4B’s quality, you have to pay the ~1,200-token-per-response inference bill. The §3.1 cost number (35× more FLOPs per correct answer than the curated 4B within ~1 pp of raw accuracy) is the direct consequence of that bundling. Asking “but what if you forced Qwen3.5-4B to be shorter?” is asking a counterfactual the data are silent on, because Qwen3.5-4B is never observed shorter in the operating mode it’s deployed in.

Cost-of-Pass: cheaper per correct against six of seven.

The accuracy gap is one half of the trade; cost is the other. Table˜4 pairs the two at the same self-hosted H100 roofline from §3.1 ($2.5/hr, batch 64).

Recall that the AME is the accuracy gap at matched output length: positive means the curated model is more accurate at the same length, negative means the comparator is.

Table 4: Length-controlled accuracy and Cost-of-Pass against the frontier-comparable pool. Per comparator: the length-controlled AME (positive favors the curated model), the length-distribution overlap, and self-hosted FLOPs-per-correct at the §3.1 H100 roofline (cost ratio is comparator / Datology). The curated model is cheaper per correct than six of seven; only InternVL3-2B (smaller and shorter) is cheaper, at 10 pp lower fixed-length accuracy. The Perceptron-Isaac-2B row carries a low-overlap caveat (18%); 95% CIs for the AMEs are in Table˜D.4.

Against the Qwen family the curated model gives up 4–7 pp of length-controlled accuracy but costs 2.4–3.6× less per correct answer. Against the InternVL3.5 family — and, narrowly, Perceptron — it wins on both axes at once, 2–10 pp more accurate and 2.2–3.9× cheaper. The single inversion is InternVL3-2B: a smaller, shorter model that is 25% cheaper per correct but 10 pp less accurate at fixed length, the one comparator that buys its cost edge with accuracy the curated model declines to give up. (The §3.1 operating-point comparison against Qwen3.5-4B, excluded here for non-overlap, is the most lopsided of all: 35× cheaper per correct within ~1 pp of accuracy. The unconstrained per-example view of that pair is §3.4.)

One confound could undercut all of this: if the curated model’s shorter outputs came from refusing to answer more often, its length advantage would be a behavioral shift, not genuine brevity. It is not. A three-term decomposition of the mean-length gaps above attributes 98–100% of each to response-conditional brevity (shorter answers when the model does answer); the refusal-rate shift contributes at most 1.8%, often negative, and the curated model essentially never refuses on standard evals. The result survives a refusal-enriched stress test (Appendix C).

Against the frontier-comparable pool, brevity buys cost-efficiency without surrendering competitiveness. The curated model sits on the cost-efficient Pareto front against six of seven length-comparable external models; the lone exception trades accuracy away for its cost edge. The eighth external model (Qwen3.5-4B) admits no length-controlled comparison at all: it is never observed at a concise operating point.

3.4. Verbose tokens earn their cost on a shrinking window of capabilities

§3.1–§3.3 establish that the curated model is competitive on accuracy while emitting dramatically fewer tokens. The natural follow-up is the inverse question: when are verbose responses actually contributing to correctness, and when are they filler? We answer it on a triple dissociation across three conditions in the pool: the curated model (Datology, D ) is concise and non-reasoning (~30 tokens per response); Mammoth(M ), the uncurated MAmmoTH-VL baseline, is verbose (~60 tokens) but non-reasoning; and Qwen3.5(Q ) is both verbose (170 tokens at 2B, 1,284 at 4B) and reasoning-structured (chain-of-thought). We refer to the three by these initials throughout.

Our operational definition of “verbose tokens don’t contribute anything” is: on the same example, Q≈D on correctness. The verbose model produces no per-example accuracy advantage over the concise model. The dissociation distinguishes whether verbosity as such is the lever (M would beat D ) or whether reasoning-structured verbosity is doing the work (Q would beat D and M , M ≈ D ).

We run the analysis on all 18 evaluations of the grouped frontier suite at the two scales where the dissociation is available: 2B (Datology2B, Mammoth2B, Qwen3.5-2B) and 4B (Datology4B, Mammoth4B, Qwen3.5-4B). How we binarize per-example correctness, define the pairwise deltas (Q−D, M−D, Q−M), and assign each (scale, capability) comparison a regime is set out in §M.6; the per-example contingency machinery is in Appendix D.

Generic verbosity (Mammoth) contributes nothing, anywhere.

Mammoth’s verbose-but-not-reasoning outputs never show a positive accuracy advantage over the curated model at any capability or scale: M−D is negative everywhere, from −2.6 pp at best (2B Captioning and Chart & Diagram) to −54.9 pp at worst (4B Referring & Grounding, where Mammoth largely fails to emit valid bounding boxes). This is the cleanest empirical version of the “tokens not paid for” claim: a model emitting 2–3× the curated baseline’s tokens, without structuring them as reasoning, recovers none of the accuracy gap and often opens new ones. Generic verbosity does not buy correctness.

Reasoning-structured verbosity helps on a window that shrinks with scale.

Qwen3.5’s chain-of-thought verbosity is a different story, but a shrinking one. At 2B it beats the curated model by ≥2 pp on four of eight capability groups (led by Math, +17.1 pp); at 4B that window collapses to one (OCR & Document, +2.9 pp), and on four groups the concise curated model instead wins by ≥2 pp, by as much as 9.6 pp on Referring & Grounding (Figure˜3; full table in Table˜D.5). The shrink is concordant with §3.2’s length-controlled regression, where the curation effect grows super-linearly with scale (+15.6 pp at 2B, +21.2 pp at 4B); the capability-level attribution this section adds leaves OCR & Document as the lone domain where reasoning still pays its way (structured step-by-step transcription benefits from explicit intermediate steps).

Figure 3: Verbose reasoning loses its accuracy advantage at scale. Each capability pairs its 2B cluster with its 4B cluster, every cluster a trio of DatologyAI (D), Mammoth (M), and Qwen3.5 (Q) mean accuracy, so the scale shift reads within a capability. The label above each cluster names the regime, assigned by comparing the curated model against each verbose model at a 2-percentage-point threshold: concise wins when D beats both Q and M; reasoning helps when Q beats D while the non-reasoning M trails it (reasoning-structured verbosity earns its tokens where generic verbosity does not); and mixed otherwise. Capabilities are ordered left-to-right by how that regime moves with scale; the denoted region marks the five where the curated model gains ground from 2B to 4B. Reasoning’s winning window shrinks from four capabilities at 2B to one (OCR & Document) at 4B.

Within-Captioning divergence: CAPability and DetailCaps point in opposite directions.

The Captioning row aggregates two evals with very different demands, and they point in opposite directions at both scales (Table˜5):

Table 5: Within-Captioning divergence. CAPability (recall) rewards verbose enumeration; DetailCaps (precision) penalizes it. The group’s “Mixed” verdict averages a recall gain and a precision loss.

CAPability tests whether a caption mentions a set of ground-truth objects: a recall-style task where verbose enumeration helps, because more tokens are more chances to name the right object. DetailCaps tests whether the claims a caption makes are supported by reference captions: a precision-style task where verbose elaboration can hurt, because each additional claim is another opportunity for an unsupported assertion. The Captioning group’s net “Mixed / inconclusive” verdict at both scales averages a recall gain (+13.1 / +6.7) and a precision loss (−10.2 / −3.2). Treating “Captioning” as a single capability obscures this. A reader interested in caption recall should expect verbose reasoning to help; a reader interested in caption precision should not.

Per-example contingency: Q and D have different error sets.

Mean accuracy says which model is more often correct; it hides how the two disagree. A per-example contingency on shared examples shows that Qand D reach substantially different answers across nearly every capability: at 4B, Qwen3.5’s rescue rate (the fraction of D ’s misses it gets right) exceeds 25% in every capability, and beats its own misled rate on 7 of 8. Yet mean accuracy still favors D on most of them, because when D ’s accuracy floor is high, a small misled rate on the large “D -correct” base outweighs a large rescue rate on the small “D -wrong” base. The two models are genuinely complementary at the example level, but the concise model still ships more correct answers per query; the contingency view is the lens that matters for routing, ensembling, or conditional reasoning, not for the cost-per-correct framing of §1 and §4. The full rates, the reconciling identity, and a worked example are in Appendix D.

Benchmarks that diverge from their capability group.

Five individual (scale, benchmark) cells sit >10 pp away from their capability-group mean. Two are the CAPability / DetailCaps split discussed above; the other three are PixMo Points at both scales and OCRBench at 2B:

  • PixMo Points (Referring & Grounding): Q−D=−24.6 pp at 4B and −9.6 pp at 2B, vs the group means of −9.6 pp and +0.8 pp respectively. The pointing-output task format (predicting structured <point x="..." y="..."> tags) is uniquely hostile to chain-of-thought reasoning models, which appear to talk themselves out of correct coordinates: the Qmisled rate at 4B is 73.4%, the highest of any (scale, benchmark) pair. This is consistent with §3.3’s note that grounding tasks are exactly where the curated model’s concision is operationally most valuable.
  • OCRBench at 2B: Q−D=+15.7 pp vs the OCR & Document group mean of +4.5 pp. OCRBench’s free-form character-recognition style benefits from explicit reasoning more than the document-QA tasks (DocVQA, TextVQA) in the same group.

Full per-benchmark numbers (mean accuracy and conditional rates) are in the appendix (Table D.1).

Verbose tokens earn their cost on a capability window that shrinks with scale. Generic, non-reasoning verbosity (Mammoth’s 2–3× token count) helps nowhere: M−D is negative for every capability at both scales. Reasoning-structured verbosity (Qwen3.5’s chain-of-thought) helps 4 of 8 capability groups at 2B but only 1 at 4B, where the concise curated model instead wins 4 of 8. Per example, Q and the curated model have substantially different error sets (rescue rate ≥25% everywhere), yet D ’s high accuracy floor keeps most of that from moving mean accuracy in Q ’s favor. At scale, brevity-via-curation appears to remove the conditions under which verbose reasoning would have paid.

4. Conclusion

This work argued for inference efficiency through brevity, achieved through data curation, with our VLM pretraining curation as the case study:

What the analyses establish.

(1) Brevity saves money, by 35× per correct answer at 4B. Datology 4B produces a correct answer for 0.41 TFLOPs to Qwen3.5-4B’s 14.58, a 35× Cost-of-Pass gap within ~1 pp of accuracy (0.691 vs 0.704); within the matched-scale curated-vs-uncurated pairs, the curated model costs 2.2× / 1.6× / 2.3× fewer FLOPs per correct answer at 1B / 2B / 4B (§3.1).

(2) Quality holds at matched length, and the effect grows with scale. At the same output length on the same benchmark, the curated model is +17.55 pp more accurate than its uncurated pair (pooled across scales), and the per-scale effect rises from +15.6 pp at 2B to +21.2 pp at 4B: curation pays off more at larger scale, not less (§3.2).

(3) Generic verbosity buys nothing, and reasoning verbosity’s edge narrows with scale. The verbose non-reasoning model never beats the curated model on any of the 16 scale × capability cells, and reasoning-structured verbosity’s edge narrows from 4 of 8 capability groups at 2B to 1 of 8 at 4B (§3.4).

(4) Reasoning is a distinct route, not a cheaply-emulable one. Per example, the curated model reaches correct answers the verbose reasoning model misses, and vice versa. This suggests that curating for reasoning and curating for brevity are complementary levers, and that a model trained to be both concise and a stronger reasoner is the natural next build (§3.4).

Takeaway.

We argue for inference efficiency through brevity, achieved through data curation. The standard toolkit (quantization, distillation, MoE, speculative decoding) works on the per-token cost and treats output length as a fixed input. Output length is the bigger lever in practice, and curation pulls it without a quality tax: at the scales we measured, the same curation move yields shorter outputs and higher matched-length accuracy, the advantage reaching a 35× Cost-of-Pass gap at 4B and the accuracy effect growing super-linearly in model size.

Per-token serving cost keeps falling through hardware and software gains, but per-correct cost is not following it down, because cheaper tokens induce more demand (the Jevons rebound) on two fronts: longer reasoning chains within a query, and more queries across users. The within-query front is already visible across the current open-weight 2B–4B VLM pool, where mean output length spans roughly two orders of magnitude and the most expensive comparators by Cost-of-Pass are systematically the most verbose (§3.1). Brevity-per-correct removes that front: holding quality per task fixed pins the within-query token count, leaving demand induction with one front instead of two. Inference efficiency in this regime is not a per-token-cost problem; it is a tokens-per-correct problem, and brevity is the lever that targets it directly. Curation is how you pull it, and the reason it works is amortization: the cost of concision is paid once, at training, and compiled into the weights rather than re-paid at every decode. That points forward, too. Production usage can reveal where a deployed model still over-spends tokens; each such pattern is a target for the next curation pass; and the loop tightens with every iteration, until amortized inference becomes continual learning.

Pascal apologized for the long letter because he had no time to make it shorter. Brevity is work, and a model trained on verbose data is in Pascal’s position at every generation, with no time to shorten any answer. There are two places to find that time: decode-time methods buy it back one query at a time, shortening each letter as it is written, while curation buys it once, before the model ships, with the saving amortized across every answer the model will ever produce. Pascal paid for brevity by the letter; curation pays for it by the corpus, and collects on every generation thereafter.

Contributions and Acknowledgements

Core Contributors: Matthew Leavitt and Sid Joshi
for making this letter longer than usual.

Technical Contributors: Haoli Yin, Rishabh Adiga, Haakon Mongstad, and Alvin Deng
the giants, whose shoulders proved eminently standable.

Leadership: David Schwab, Bogdan Gaza, Ari Morcos
for watching the horizon while the rest of us watched the loss curves.

Acknowledgements: Kaleigh Mentzer, Luke Merrick, and Pratyush Maini
for thoughtful feedback that made it shorter AND better.

M. Methodology

M.1. The model pool & eval grid

The model pool is built to do two things at once: matched-scale controlled pairs that isolate the curation effect at each size, and a set of frontier comparators that place it in context. We include 14 models:

  • Datology and Mammoth at 1B, 2B, and 4B activated parameters (six internal models), the controlled matched-scale comparison. At each scale the pair shares an identical backbone, VL recipe, and 4k context; only the pretraining data differs. Each matched pair isolates the curation effect from confounds of architecture or training procedure, and spanning three scales lets us measure how that effect varies with model size.
  • 8 frontier open VLMs at 2–4B activated parameters: Qwen3.5-2B, Qwen3.5-4B, Qwen3-VL-2B-Instruct, Qwen3-VL-4B-Instruct, InternVL3-2B, InternVL3.5-2B, InternVL3.5-4B, Perceptron-Isaac-0.2-2B-Preview.

For internal models with multiple training seeds, seeds are pooled at the per-example level before any per-model aggregate is computed: seed-average the per-example signal, then aggregate to per-model. This keeps the per-model statistic conservative, since variance across seeds shows up as wider example-level distributions rather than an inflated effective sample size.

The evaluation grid is 20 frontier evals across 8 capability families: Referring & Grounding, General VQA, OCR & Document, Captioning, Spatial & 3D, Counting, Chart & Diagram, and Math (BLINK and BrandID are aggregate-only and appear only at the highest level of aggregation). Each eval has a single headline metric (recall@0.5 for COCO-like grounding tasks, accuracy for MCQ, CAPTURE-MicroOverall for DetailCaps, F1-MicroOverall for CAPability, normalized OCRBench, and so on), normalized to [0,1].

M.2. Inference cost model

The headline cost numbers in this post come from a decode-dominated FLOPs proxy,

FLOPsresponse≈2Nnout,

where N is active parameters and nout is the number of generated tokens. We use mean generated tokens per response as the central tendency throughout, for two reasons. Mean is the right quantity for cost forecasting, since hosting bills sum over tokens, not percentiles. And it is the central tendency that makes the inference-cost claim concrete: “this model costs $X per million correct answers”, not “this model sits in some percentile of cost-efficiency”.

Scope of the proxy.

The 2Nnout term is the decode feed-forward cost, linear in active parameters and output length. It omits attention, which is quadratic in sequence length, and it assumes compute scales with active parameters, setting aside the bandwidth and memory-capacity limits that dominate for larger or sparsely-activated models. For the small dense VLMs we study, with responses of tens of tokens, both omissions are negligible and the proxy is accurate; for the long-output reasoning models they are not. Attention is a non-trivial share of Qwen3.5-4B’s ~1,300-token responses, so the proxy understates their cost, and because the understatement grows with output length, every Cost-of-Pass gap we report between a concise model and a verbose one is conservative.

Two pricing lanes.

  • Provider-hosted (for frontier open models): dollar cost per output token is the average of Together and Fireworks public per-token rates as of the data-collection date, fixed per-model.
  • Self-hosted (for Datology and a head-to-head appendix on the frontier set): a memory-bandwidth roofline,cost/token=$2.5/h/3600⋅b⋅2N/b⋅BW=$2.5/h⋅2N/3600⋅BW,with BW=3.35 TB/s (H100) and 2 bytes/param (FP16). Decode is memory-bandwidth-bound: all sequences in a batch share a single weight-loading pass through HBM, so wall-clock time per token is approximately independent of batch within the bandwidth bound, but that wall-clock cost amortizes across the B sequences, so per-token dollar cost scales as 𝟏/B. We use 𝑩=𝟔𝟒 as a log-anchored placeholder; the 1/B factor is baked into all CoP figures in this post, and cost ratios between models are batch-invariant.

From these we derive Cost-of-Pass, the dollar cost per correct answer on a given eval:

CoP=(cost-per-output-token)⋅nout/accuracy.

CoP is the central cost number in the post: it has the right dimensional analysis to support the brevity saves you money framing.

Dollar Cost-of-Pass, worked.

Instantiating CoP under the self-hosted roofline ($2.5/hr, BW=3.35 TB/s, FP16, B=64),

$/M correct=$2.5/3600/B⋅BW⋅2N⋅n̄out/accuracy.

a million correct answers cost $0.33 from Datology 1B, $0.87 from Datology 2B, $1.34 from Datology 4B, $3.16 from Qwen3.5-2B, and $47.24 from Qwen3.5-4B: the same ~40× ratio between the curated 4B and Qwen3.5-4B that the FLOPs accounting reports, plus a ~10× ratio between the curated 1B and Qwen3.5-2B at comparable accuracy. A million correct answers from Datology 4B costs roughly the price of a soda; from Qwen3.5-4B, roughly a tank of gas. The dollar cost is linear in FLOPs/correct at fixed batch and bandwidth, so the dollar ratios equal the FLOPs ratios and the headline gaps are invariant to reasonable hardware assumptions.

M.3. Quality measurement: what we test, what we don’t claim

The brevity-saves-money claim has two halves: brevity reduces cost (§M.2), and brevity doesn’t sacrifice the things you’d worry about losing. The second half needs a quality measurement that is fair across length distributions, so verbose models aren’t penalized for verbosity itself, and that separates the dimensions that matter: a model that is concise but omits things is failing differently from one that is concise and grounded.

We instrument two captioning evaluations from the suite: DetailCaps (precision-side: did the model assert things that are correct?) and CAPability (recall-side: did the model cover the things it should?). They’re chosen because they’re the captioning evals in our frontier set with per-example signals we can manipulate, and because length is the natural axis they vary along.

Boundary of inquiry.

All our quality signals in this section are text-based: model caption against reference captions (DetailCaps precision) or against a single ground-truth object label (CAPability recall). We are explicitly not doing image-grounded measurement here: the judge does not see the image. Where this changes how a result should be read, we caveat. See §M.4 for the long discussion of what this rules in and out for the hallucination claim specifically.

M.4. Why lexical-overlap precision is the wrong tool, and what an LLM judge does instead

We tried two precision measurements on DetailCaps and got incompatible answers. Working out which one to trust is itself the story.

Method 1: lexical-overlap object precision.

For each (caption, references) pair: extract content lemmas from the caption (lowercase alphabetic tokens ≥3 chars, drop stopwords and caption-boilerplate, light plural-to-singular fold), do the same for the union of the three references, compute precision =|mentioned∩supported|/|mentioned|. This is a fast deterministic proxy for CAPTURE-style scoring that doesn’t need the CAPTURE dependency stack. It’s also the standard cheap object-precision proxy that shows up in a lot of hallucination tooling.

Method 2: a frontier LLM as a claim-level judge.

For each (caption, references) pair: ask the judge (Claude Haiku 4.5, temperature 0) to decompose the caption into atomic factual claims, then label each as one of three:

  • supported: some reference confirms or paraphrases it semantically (lexical match not required).
  • unsupported: some reference contradicts it, or the candidate asserts the existence or identity of an object or attribute that no reference mentions and that should reasonably be visible. This is the hallucination signal.
  • unverifiable: the claim is about something the references genuinely do not address (subjective mood, emotion, intent, abstract interpretation, off-camera inference, photographer style). Excluded from the precision denominator entirely.

Precision =supported/(supported+unsupported). Temperature 0, deterministic across re-runs. The prompt was smoke-validated by hand against actual images for 200 calls before committing to the full batch.

The headline disagreement.

Across all 14 models on the full 4,870 DetailCaps examples (68,180 judge calls in the full batch), Table˜M.1 reports the split:

Table M.1: Lexical and LLM-judge precision disagree on which models hallucinate. The two methods even rank Qwen3.5-4B oppositely (worst under lexical, best under the judge).

The cross-model rank-agreement number is the one that decides it: the two methods disagree on which models are more precise, not just on what “precise” numerically means. They rank Qwen3.5-4B as the worst (lexical, precision 0.313) and the best (LLM, precision 0.879) of the 14. That’s not a calibration disagreement; that’s a measurement disagreement.

Worked example: Qwen3.5-4B on a typical caption.

Qwen3.5-4B writes ~5,200-character captions with structured reasoning preambles (“The user wants a detailed description… 1. Identify the main subject: … 2. Analyze the central figure: …”). On the Monopoly-Man graffiti image we hand-inspected in the smoke review:

  • Lexical method counts every content lemma. A 5,200-char caption produces ~600+ content lemmas after stopword filtering, including reasoning scaffolding (“identify”, “subject”, “central”, “figure”, “analysis”) and meta-commentary (“appears”, “depicts”, “rendered”). Many of these aren’t in the human references, so they count as unsupported by the denominator-inflation mechanism. The proxy returns precision ≈0.30 for this caption.
  • LLM judge extracts ~20 atomic factual claims: “the artwork is graffiti on a wall” (supported), “the central figure is a Monopoly Man” (supported), “to the left is a vertical wooden beam” (unsupported, references don’t mention a beam), “the style is reminiscent of 1980s graffiti” (unverifiable, interpretive). It ignores the scaffolding because it isn’t a factual claim. Judge returns precision =0.85.

Why the mechanisms diverge.

Both methods compute supported / total, but they define total differently:

  • Lexical “total” = bag of content lemmas, which scales roughly linearly with caption length. Adding structure, hedge words, or reasoning prose inflates the denominator without contributing supportable claims.
  • LLM “total” = number of atomic factual assertions, which scales much more slowly with length. In our data, Qwen3.5-4B’s 5,229-char captions average 18 claims; InternVL3.5-4B’s 752-char captions average 13.3 claims: 7× the characters, 1.35× the claims.

Length inflates the lexical denominator faster than the supportable numerator can keep up, mechanically lowering precision for verbose models. This fully explains the +0.94 cross-model length-vs-hallucination correlation under the lexical method: it’s a length-counting artifact, not a hallucination signal.

Why we trust the LLM-judge result for cross-model claims.

Despite known LLM-judge limitations:

  1. Smoke-validated against actual images. We hand-checked 200 calls and their per-claim verdicts against the source images and confirmed the judge correctly catches misidentifications, OCR errors, and false attribute claims; correctly routes subjective material to unverifiable; rarely over-decomposes; and rarely fabricates claims not in the candidate (~1 minor instance in 200 calls).
  2. Temperature 0, deterministic, no reproducibility drift between calls or re-runs.
  3. The unverifiable bucket explicitly separates what the references don’t address from what the candidate got wrong. The lexical proxy cannot make this distinction; any candidate content not in the references is forced into the unsupported pile regardless of whether it’s a hallucination or simply outside the references’ scope.

Where we drew the boundary of inquiry.

Both measurements compare caption text to reference text. Neither sees the image. We’re claiming that under the operational definition of ground truth used by the DetailCaps benchmark, that the three human reference captions are an accurate description of the image, the LLM judge gives cross-model precision scores that don’t track length. What would invalidate that claim is the references themselves diverging from the image: a reference says “yellow car” when the image contains a blue one, or omits a clearly-visible elephant. Verifying reference fidelity is exactly what an image-grounded judge would test, and we treat it as out of scope here because the benchmark-quality question (“are DetailCaps’ references a faithful representation of the image?”) and the brevity question (“does verbosity buy precision in modern VLMs?”) are conceptually separable, with an image-grounded judge answering the former.

An image-grounded study, canonical CHAIR with COCO instance masks or a VLM judge that sees the image, could give a different cross-model length-vs-precision number. We don’t pre-empt that and we don’t claim to overturn image-grounded prior literature on length and hallucination. The honest position for a reader: if you have prior reasons to doubt DetailCaps’ references as a proxy for image content, the right place to discount our result is the benchmark-design discussion, not the brevity-vs-precision finding.

M.5. Why we don’t use the existing CAPability judge for hallucination

The CAPability eval already ships with a Qwen3-30B-A3B-Instruct-2507 judge whose per-example scores are persisted in the eval output. We use them, but for recall, not precision. The judge is given a single ground-truth object label (e.g. “stain”, “arch”) and asked whether the caption mentions this target. Score =1 (yes), 0 (no), −1 (judge couldn’t decide).

That’s a coverage check. “No” means the model omitted the target, not that the model hallucinated. The judge is not testing whether the caption made any false claims, only whether it made enough claims to cover the target list.

Using this directly as a hallucination signal would be a mistake, and tempting because the scores are already computed. It would also push in the wrong direction: verbose models naturally cover more targets and would appear less hallucinatory by this measure simply because they say more things. The same measurement is informative as recall, and that’s how we use it: per-model fraction of capability targets correctly covered, category-balanced over the 9 capability dimensions, with the −1 rows excluded from the rate denominator. Together, DetailCaps LLM-judge precision and CAPability text-judge recall are the two halves of the brevity-doesn’t-cost-anything claim.

M.6. Per-example metrics for the verbosity dissociation

The §3.4 dissociation runs on all 18 evaluations at 2B and 4B; seeds for the internal models are majority-pooled.

Binary correctness. For 17 of 18 evals we threshold the per-example score at 0.5 (binary outcomes pass through; the few continuous-in-[0,1] outcomes, DocVQA, TextVQA, and PixMo Points, become binary at this cut), and drop the CAPability judge-undecided (−1) rows. For DetailCaps, where no rule-based per-example score is available, we use frontier LLM-judge precision and call a caption correct if precision ≥0.7 (≥70% of the atomic factual claims supported by the human references, unverifiable claims excluded; §M.4). Threshold knob: at 0.5, 95–99% of captions pass and models are indistinguishable; at 0.85, only 37–66% pass; at 0.7 the pass rate is 70–93%, the regime where models meaningfully differ, and the qualitative pattern is stable from 0.6 to 0.8.

Mean accuracy and pairwise deltas (Q−D, M−D, Q−M). Per (model, benchmark, scale), the mean of the per-example correctness label across examples. The three differences isolate distinct effects:

  • Q-D: the deployment-relevant question, whether Qwen3.5 produces more correct answers per query than the curated model at fixed scale.
  • M-D: the verbosity-without-reasoning effect (M emits ~60 non-chain-of-thought tokens, D ~30). If M−D is negative everywhere, generic verbose tokens do not buy correctness.
  • Q-M: the reasoning-content effect controlling for verbosity (Q and M both emit non-trivial tokens; Q’s are reasoning-structured).

Regime classification. Each (scale, capability) comparison is labeled by the pair (Q−D,M−D) at a 2-pp threshold: Reasoning helps (Q−D>2, M−D within ±2); Reasoning helps, generic verbose hurts (Q−D>2, M−D<−2); Concise wins (both <−2); Filler (both within ±2); and Mixed / inconclusive otherwise. Lowering the 2-pp cut sharpens the named-regime assignments but is noisier; raising it pushes more comparisons to “Mixed.”

A. Matched-scale captioning quality

The blog’s central positive claim is on the matched-scale controlled contrast: Datology 2B vs Mammoth 2B. Same Qwen3-1.7B backbone. Same VL training recipe. Same 4k-token context. Identical setup in every dimension except pretraining data composition. This isolates the curation effect from scale, architecture, recipe, and context-length confounds.

On this pair the matched-scale captioning metrics line up (Table A.1):

Table A.1: Captioning-specific metrics on the controlled 2B pair. Precision and recall favor Datology by a small but robust margin (precision +0.018 is ~4.5× the cross-seed std; the per-seed bands don’t overlap); the large magnitudes are in brevity (59% shorter) and total hallucinated mentions (19% fewer).

At matched scale, curation produces a caption that is equal-or-better on the things we measure as quality and substantially cheaper on the things we measure as cost. That’s the strict-Pareto-improvement claim, and it survives every methodology choice in §M. The frontier 4B comparisons (Qwen3.5-4B, Qwen3-VL-4B, InternVL3.5-4B) are honest mixed pictures: verbose 4B models match or beat Datology 2B on the quality axes but at much higher per-token cost, and the main text treats them as such, not as wins.

The per-caption hallucinated-mention count decomposes into a length effect (longer outputs accumulate more hallucinations because there are more chances to err) and a per-token-rate effect (does each unit length carry more or fewer hallucinations?). For the curated–baseline contrast with ΔH=HDatology-HMammoth and H=n⋅r:

ΔH=(nDatology-nMammoth)⋅rMammothlength effect+nDatology⋅(rDatology-rMammoth)per-token-rate effect.

On the controlled pair: length effect =-1.46, per-token-rate effect =+0.99, net ΔH=-0.47. At matched scale, the curated model produces fewer total hallucinated mentions per caption by saying less, not by saying each thing more faithfully; per character, its grounding is essentially Mammoth’s. We are therefore not claiming that curation teaches the captioning model to hallucinate less per token. The claim is narrower: the accuracy improvement persists through length control on the 18-eval suite (the +17.6 pp pooled AME), and on the long-form captioning subset the hallucination win is explained by brevity.

B. Output length: per-model distributions and accuracy splines

Figure˜B.1 shows the per-response output-length distribution for every tested model, split by whether the task calls for a short answer or a long one. The curated models collapse to a few tokens on concise tasks and lengthen only for captioning, while the reasoning models stay long regardless — the operating-length gap that the cost analysis (§3.1) turns into a Cost-of-Pass gap.

Figure B.1: Curated models stay concise when the task is concise; verbose models don’t. Per-response output-length distribution for every tested model, split by task type: concise tasks (MCQ, VQA, OCR, math, …; top half of each violin) versus verbose captioning tasks (DetailCaps and CAPability; bottom half). Models are sorted by concise-task median; the white and grey ticks mark the concise and verbose medians. DatologyAI and Mammoth collapse to a few tokens on concise tasks and lengthen only for captioning, whereas the reasoning models stay long even when a single token would do: Qwen3.5-4B never drops below a few hundred tokens, which is why it has no concise operating point (§3.3).

Figures˜B.2 and B.3 plot the fitted length–accuracy splines behind the matched-scale (§3.2) and frontier (§3.3) length-controlled AMEs. The black curve in each is the length response pooled across the models shown (one shared spline plus an additive per-model offset), so the models differ by the AME, not by the shape of their length response; each model’s mean output length is a dot on the curve, and its length distribution is the strip beneath.

Figure B.2: At matched scale, the accuracy gap is model identity, not length. Marginal Pr⁡(correct) against output length (black curve, controlling for benchmark); each model’s mean output length (the 20-eval macro-mean, the same quantity the cost analysis uses) is a dot on the curve, and its full length distribution is the strip along the bottom. The black curve is pooled across both models: the GLM fits a single shared length spline plus an additive model term, so both follow the same curve shape and differ only by a model-level offset (the AME), not by how accuracy responds to length. DatologyAI averages ∼39 tokens and Mammoth ∼62 (the curated model is the shorter of the pair); both means sit on the flat part of the curve (the suite is dominated by short-answer benchmarks, with a long-form captioning tail), so the length difference contributes essentially nothing to the raw mean gap (raw +17.1≈ AME +17.6). The annotated AME is the model-identity coefficient, the curation effect with length held fixed, not a length artifact. The GLM is specified in §M.3.

Figure B.3: Length-accuracy splines for all eight frontier comparators. Each panel pairs the matched-scale curated model (2B or 4B) with one external model: marginal Pr⁡(correct) against output length (black curve, controlling for benchmark), each model’s mean output length (the 20-eval macro-mean) a dot on the curve, and its full length distribution the shaded strip along the bottom. In every panel the black curve is pooled across the two models shown (one shared length spline plus an additive model offset), so they differ by the AME, not by the shape of their length response. The annotated AME (the model-identity gap at matched length) ranges from −7 pp against Qwen3.5-2B to +10 pp against InternVL3-2B. The bottom-right pair, Datology 4B vs Qwen3.5-4B, is the one that fails the length-overlap test (2.2%; Table˜D.3): the two models barely share an operating range, so its AME is an extrapolation and the pair is excluded from the main-text comparison (Table˜4).

C. Refusal is not the mechanism: the length gap is response-conditional

The observed Cost-of-Pass advantage rests on a lower mean output length. A reading consistent with those numbers but worse for the brevity story would be that the curated model’s lower mean arises because it refuses to answer more often. Refusals tend to be terse (e.g. “I can’t help with that”, “I cannot see X from this image”), so a model that refuses an extra ten percentage points of the time would mechanically post a lower mean output length without changing its response-conditional behavior. Under that mechanism, §3.1’s cost advantage would describe a safety / behavior shift rather than inference efficiency in earnest.

This hypothesis is testable. The mean-output-length gap n̄a-n̄b between the curated model (a) and a comparator (b) decomposes additively into three terms:

a-n̄b= (Pa(R)-Pb(R))⋅(Eb[len|R]-Eb[len|r])(i) refusal-rate shift

+Pa(R)⋅(Ea[len|R]-Eb[len|R])(ii) refusal length, given refusing

+(1-Pa(R))⋅(Ea[len|r]-Eb[len|r])(iii) response length, given responding

where P(R) is the probability of refusing and r denotes responding. Term (i) captures “refuses more often”; (ii) captures “refuses more tersely”; (iii) captures “responds more concisely.” Under the safety-shift reading, term (i) would dominate. Under the inference-efficiency reading, term (iii) would dominate.

Refusal classification combines a rule-based regex pass (matching twelve VLM-abstention patterns: “I cannot see”, “is not visible”, “I’m sorry, but”, etc.) with a frontier LLM-judge cleanup that distinguishes genuine abstentions from substantive negative-existence answers that happen to contain refusal-like phrasing (e.g. “there are no zebras in this image”, “the text is not legible from this angle”). The cleanup matters more than expected: on the eval-suite sample the regex pass alone flagged ~13.5k responses as refusals, while the LLM judge ultimately classified only ~5% of those as genuine abstentions and the remaining ~95% as substantive answers with hedge phrasing. Without the cleanup the decomposition would have wildly over-counted refusal rates.

Eval-suite decomposition: 98–100% of the length gap is response-conditional.

The primary sample is the full eval suite, the per-example responses from the same 18-eval frontier suite whose mean output lengths produce §3.1’s cost numbers. Each contrast operates on the slice of examples both models actually evaluated.

Table C.1: Three-term decomposition of the mean-output-length gap on the eval suite: response-conditional brevity (term iii) carries 98–100% of every contrast.Each row pairs the curated model (a) against a comparator (b) on the examples both evaluated (n paired). Pa(R) and Pb(R) are each model’s refusal probability after LLM-judge cleanup of the regex pass; n̄a and n̄b are mean output lengths in characters; Δ len =n̄a-n̄b. Columns (i)–(iii) give the share of Δ len carried by the refusal-rate shift, refusal length given refusing, and response length given responding, respectively; shares may not sum to exactly 100 due to rounding.

The first thing Table˜C.1 shows is mechanical: the curated model essentially never refuses on standard evaluations. Across 81k probes at 4B, the LLM judge identifies zero genuine refusals from the curated model; across 81k at 2B, also zero; at 1B, one. Comparator refusal rates are higher but still small (0.04%–0.42% across the seven contrasts).

With refusals that rare, the decomposition is one-sided. Term (iii), response-conditional brevity, carries between 98.2% and 100% of the mean-output-length gap in every contrast. Term (i), the refusal-rate shift, contributes at most 1.8% in either direction, and in three of seven contrasts it is negative: the curated model refuses slightly less often than the comparator, so the rate effect widens the length gap rather than closing it. Term (ii), conditional refusal length, is essentially zero in every contrast, because the curated model’s refusal probability is so low that the term’s Pa(R) multiplier extinguishes whatever per-refusal length difference exists.

In headline terms: Datology 4B is 199 characters shorter than Mammoth 4B per response on the eval suite, and 99.9% of that gap is response-conditional. Against Qwen3.5-4B the gap is 2,969 characters, of which 99.3% is response-conditional. On the actual eval traffic that produces §3.1’s cost numbers, the brevity advantage isn’t bought via refusal; it is bought one shorter response at a time.

Refusal-enriched bundle: response-conditional brevity holds on the harder test.

The eval suite contains few refusal-eliciting probes: the 18 evals are standard VLM tasks where any of the models studied refuses at well under 1%. A skeptical reading would be that the absence of refusals here doesn’t rule out a refusal-shift mechanism; perhaps the curated model would refuse more often on probes specifically designed to elicit refusals, and the standard evaluations are too soft a sample to surface that.

Testing that reading requires a sample enriched for refusals, which is what the 1,129-probe curated-comparison bundle provides. It was constructed during an April 2026 internal redteam exercise, with approximately 10% of probes (the refusal-calibration and termination-budget groups) explicitly designed to elicit refusal-vs-respond decisions. Refusal rates on the bundle are an order of magnitude higher than on the eval suite: the curated model refuses on 4.5% of bundle probes, Mammoth on 3.2%, Qwen3-VL-2B-Instruct on 15.2%. On a sample deliberately built to expose a refusal-rate difference, the comparison is discriminating in a way the eval suite is not (Table˜C.2).

Table C.2: The same three-term decomposition on the 1,129-probe curated-comparison bundle, the refusal-enriched harder test: term (iii) still carries the majority of the gap. Cell entries are each term’s contribution to the mean-output-length gap in characters, with its share of the total gap in parentheses.

Even on the enriched sample, term (i), the refusal-rate shift, contributes essentially zero (-0.2% vs Mammoth, -1.9% vs Qwen3-VL-2B). Term (ii), refusal length given refusing, carries 6–26%, non-negligible but only because both models actually refuse with measurable frequency on calibration probes, raising the Pa(R) multiplier from the eval-suite vanishing point. Term (iii), response-conditional brevity, carries the majority of the gap in both contrasts (75% vs Mammoth, 92% vs Qwen3-VL). The discriminating sample reproduces the eval-suite conclusion at lower numerical intensity.

Notably, the curated–Mammoth refusal-rate gap is 1.3 pp (4.5% vs 3.2%), within the 95% confidence interval of zero at n=1,093; and against Qwen3-VL the curated model refuses less often (4.5% vs 15.2%), so the rate effect actively opens the length gap rather than closing it. The length advantage in both contrasts is response-conditional.

A scope note: neither sample is representative of in-the-wild deployment traffic. The bundle is hand-curated for capability surfacing; the eval suite is a public-eval distribution. A model with a different refusal calibration might refuse much more often on real user queries than on either of these distributions. What the analysis here demonstrates is the narrower, sufficient claim: that §3.1’s length advantage is not bought via refusal-rate shift on the populations the cost numbers were computed on. A future characterization on deployment traffic would address the population-rate question this section does not.

The brevity advantage in §3.1 is response-conditional, not refusal-driven. Curation-induced brevity isn’t ascribable to increased refusal; when the curated model does respond, the responses themselves are shorter.

D. Per-benchmark and per-capability result tables

This appendix collects §3.4’s supporting tables: the per-benchmark grid (Table˜D.1, below), the per-capability triple-dissociation table (Table˜D.5), and the per-example contingency rates (Table˜D.6). In the grid, each row is one (scale, benchmark) cell; capability groups follow the §M.1 ordering and benchmarks within a group are sorted alphabetically by display name. Numbers are mean per-example correctness (per-example score thresholded at 0.5 for 17 evals; LLM-judge precision thresholded at 0.7 for DetailCaps; seeds majority-pooled for the internal models). Per-benchmark numbers are benchmark-pooled (each example weighted equally within a benchmark) and so differ slightly from the example-pooled capability numbers in Table˜D.6.

Column key.

  • 𝑫/ M / Q — mean accuracy for Datology, Mammoth, Qwen3.5at the matched scale (%, 0–100).
  • 𝑸−𝑫/ M−D — percentage-point deltas vs Datology(positive = verbose model is more accurate on average).
  • Qres (rescue) — P(Qcorrect∣Dwrong); of examples where Datologyfails, the fraction Qwen3.5rescues (%).
  • Qmis (misled) — P(Qwrong∣Dcorrect); of examples where Datologysucceeds, the fraction Qwen3.5loses (%).
  • Regime — the §M.6 regime classification with a 2-pp threshold on Q−D and M−D.

Table D.1: Per-benchmark mean accuracy (D, M, Q), pairwise deltas (Q−D, M−D), conditional rescue and misled rates (Qres, Qmis), and regime label, at both scales.

Patterns visible at the benchmark level that the per-capability roll-up smooths over.

  • PixMo Points is the largest single source of Qmis — 73–78% at both scales, by far the highest rates in the 36-cell grid. Chain-of-thought reasoning on a structured <point x y> output format is a near-pure cost: Q discards correct pointing answers it already knows. This sits behind the Referring & Grounding group’s overall trajectory.
  • The Reasoning-helps regime collapses with scale. At 2B, eight benchmarks land in the “Reasoning helps; generic verbose hurts” or “Reasoning helps” categories; at 4B, only two do (CAPability and OCRBench). The shrinking-Q -wins story is visible per-benchmark, not just per-capability.
  • The captioning split is two benchmarks pulling in opposite directions at both scales: CAPability (recall-style) puts Q ahead by +13.1 / +6.7 pp; DetailCaps (precision-style at threshold 0.7) puts D ahead by +10.2 / +3.2 pp. The Captioning group’s averaged “Mixed” verdict is structurally this split, not a noisy null result.
  • Qresis high almost everywhere (above 25% in 32 of 36 cells; above 40% in 19 of 36), confirming the contingency-view observation from §3.4 that Q and D have substantially different error sets across nearly every (scale, benchmark) pair — the direction of the mean-accuracy comparison turns on D ’s accuracy floor, not on how often Q and D reach different answers.

Pool-wide tokens per correct.

Table D.2 is the per-model tokens-per-correct ranking behind the OckScore figure (Figure˜2) and the §3.1 cost claim.

Table D.2: Tokens per correct answer (mean tokens / macro accuracy) across the 1B–4B open-weight pool, per the OckBench convention. The curated models hold the low-token frontier (47–64 tokens per correct) at competitive accuracy; the most verbose comparator, Qwen3.5-4B, spends ~40× more per correct answer.

Length non-overlap: Datology 4B vs Qwen3.5-4B.

Table D.3 is the positivity check behind §3.3: the two models almost never operate at comparable output lengths, so a length-controlled comparison of the pair is not identified.

Table D.3: Datology 4B and Qwen3.5-4B have almost no length overlap. Qwen3.5-4B is never observed at a concise operating point in our data; the positivity assumption a length-controlled comparison requires does not hold for this pair.

Confidence intervals for the frontier AMEs.

95% confidence intervals for the per-comparator length-controlled AMEs in Table˜4.

Table D.4: 95% confidence intervals for the frontier-pool length-controlled AMEs (companion to Table˜4).

Per-capability triple dissociation (full table).

Table D.5 is the per-capability companion to Figure˜3: mean accuracy and the dissociation deltas at both scales, with the regime label per cell.

Table D.5: Per-capability triple dissociation.D= Datology, M= Mammoth, Q= Qwen3.5at the matched scale. Numbers are mean per-example correctness (per-example score thresholded at 0.5 for 17 evals; LLM-judge precision thresholded at 0.7 for DetailCaps; seeds majority-pooled). Within-capability values are benchmark-averaged (each eval weighted equally). Regime labels follow the §M.6 rules with a 2-pp threshold.

Per-example contingency: rescue and misled rates.

Per (benchmark, scale, example) each model’s binary correctness sits in one cell of D,M,Q∈0,13; the marginal is mean accuracy, but the conditional rates expose structure the marginal hides. For the (D , Q ) contingency we report the rescue rate P(Qcorrect∣Dwrong) (of D ’s misses, the fraction Q gets right) and the misled rate P(Qwrong∣Dcorrect) (of D ’s hits, the fraction Q loses). The mean-accuracy gap follows from them by the identity

Qacc-Dacc=Qrescue×(1-Dacc)lift on D's misses-Qmisled×Daccloss on D's hits.

Table D.6: Per-capability rescue and misled rates at 4B. Accuracies and conditional rates are example-pooled within capability (each example weighted equally), so they differ slightly from Table˜D.5’s benchmark-pooled accuracies; the qualitative ordering is unchanged.

On 7 of 8 capability groups Qwen3.5-4B’s rescue rate exceeds its misled rate, yet on five of those mean accuracy still says Q≤D: when Dacc is high the “D -correct” base is large and the “D -wrong” base is small, so even a small misled rate outweighs a large rescue rate. Counting at 4B is the worked case: Dacc=0.939, Qrescue=53.3%, Qmisled=7.8%, giving a lift of 53.3%×0.061=+3.3 pp and a loss of 7.8%×0.939=-7.3 pp, net −4.1 pp, exactly the mean-accuracy gap. The same rates with Dacc=0.65 would instead yield a +13.6 pp gain; only the population mix changed.

Notes

  1. Thanks, Jevon
  2. That the most cost-exposed players, the closed-weight labs paying their own inference bills, are also the most token-efficient is itself a tell that brevity matters most to whoever hosts the model.
  3. Bayesian amortized inference has a precise technical meaning (training a network to approximate an intractable posterior), and we are not doing exactly that; the analogy is to the direction of the work, not the formalism.
  4. Qwen3.5-4B is a reasoning-style model whose default inference setting emits long chain-of-thought on the math, chart, and counting benchmarks in the suite. Disabling reasoning mode or capping output tokens would lower this mean at a likely non-trivial accuracy cost on the reasoning-heavy subset; the value reported here is the model’s default-configuration behavior, which is what the inference bill reflects on a deployed pipeline.

References


[1]Singhal, Prasann, Goyal, Tanya, Xu, Jiacheng, Durrett, Greg. "A Long Way to Go: Investigating Length Correlations in RLHF." *Conference on Language Modeling (COLM)* (2024) Link
[2]Dubois, Yann, Galambosi, Bal{\'a. "Length-Controlled AlpacaEval." *arXiv preprint arXiv:2404.04475* (2024) Link
[3]Chen, Lichang, others. "Length-Controlled Reward Modeling and the Verbosity Bias in Preference Data." *arXiv preprint arXiv:2409.17407* (2024) Link
[4]Wu, Yonghui, Schuster, Mike, Chen, Zhifeng, Le, Quoc V., Norouzi, Mohammad, others. "Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation." *arXiv preprint arXiv:1609.08144* (2016) Link
[5]Murray, Kenton, Chiang, David. "Correcting Length Bias in Neural Machine Translation." *Proceedings of the Third Conference on Machine Translation (WMT)* (2018) Link
[6]Holtzman, Ari, Buys, Jan, Du, Li, Forbes, Maxwell, Choi, Yejin. "The Curious Case of Neural Text Degeneration." *International Conference on Learning Representations (ICLR)* (2020) Link
[7]Welleck, Sean, Kulikov, Ilia, Roller, Stephen, Dinan, Emily, Cho, Kyunghyun, Weston, Jason. "Neural Text Generation with Unlikelihood Training." *International Conference on Learning Representations (ICLR)* (2020) Link
[8]Park, Ryan, Rafailov, Rafael, Ermon, Stefano, Finn, Chelsea. "Disentangling Length from Quality in Direct Preference Optimization." *arXiv preprint arXiv:2403.19159* (2024) Link
[9]Du, Zheng, Kang, Hao, Han, Song, Krishna, Tushar, Zhu, Ligeng. "OckBench." *arXiv preprint arXiv:2511.05722* (2025) Link
[10]Kaiser, Daniel, Frigessi, Arnoldo, Ramezani-Kebrya, Ali, Ricaud, Benjamin. "Beyond Accuracy: Decomposing the Reasoning Efficiency of LLMs." *arXiv preprint arXiv:2602.09805* (2026) Link
[11]Kaiser, Daniel, Frigessi, Arnoldo, Ramezani-Kebrya, Ali, Ricaud, Benjamin. "CogniLoad." *arXiv preprint arXiv:2509.18458* (2025) Link
[12]Gundlach, Hans, Lynch, Jayson, Mertens, Matthias, Thompson, Neil. "The Price of Progress: Price Performance and the Future of AI." *arXiv preprint arXiv:2511.23455* (2025) Link
[13]Erol, Mehmet Hamza, others. "Cost-of-Pass: An Economic Framework for Evaluating Language Models." *arXiv preprint arXiv:2504.13359* (2025) Link
[14]Lee, Seongyun, others. "On the Length-Hallucination Trade-off in Detailed Image Captioning." *International Conference on Learning Representations (ICLR)* (2025) Link
[15]Udandarao, Vishaal, others. "CREPE." *arXiv preprint arXiv:2506.08227* (2025) Link
[16]Jung, Mingi, others. "Visual Attention Never Fades: Selective Progressive Attention Recalibration for Detailed Image Captioning." *arXiv preprint arXiv:2502.01419* (2025) Link
[17]Chen, Dongping, others. "MLLM." *arXiv preprint arXiv:2402.04788* (2024) Link
[18]Gershman, Samuel J., Goodman, Noah D.. "Amortized Inference in Probabilistic Reasoning." *Proceedings of the Annual Meeting of the Cognitive Science Society* (2014)
[19]Sutton, Richard S.. "The Bitter Lesson." (2019) Link
[20]Hoffmann, Jordan, Borgeaud, Sebastian, Mensch, Arthur, others. "Training Compute-Optimal Large Language Models." *arXiv preprint arXiv:2203.15556* (2022) Link
[21]{DatologyAI Team. "20/20 Vision Language Models: A Prescription for Better VLMs." *arXiv preprint arXiv:2605.11405* (2026) Link
[22]Guo, Jarvis, others. "MAmmoTH-VL." *arXiv preprint arXiv:2412.05237* (2024) Link
[23]{Epoch AI. "LLM." (2025) Link
[24]{Epoch AI. "LLM." (2025) Link
[25]Brown, Noam. "Report capability against a compute budget, not as a single number." (2026) Link
[26]Pichai, Sundar. "Google I/O 2026 keynote: token-volume growth." (2026) Link
[27]{CNBC. "OpenAI." (2026) Link
[28]Altman, Sam. "The world will be capacity-constrained for some time." (2026) Link
[29]{Menlo Ventures. "2025: The State of Generative AI." (2025) Link
[30]{TechCrunch. "Uber caps employee AI." (2026) Link
[31]{The Decoder. "OpenAI." (2026) Link
[32]{AOL. "OpenAI." (2026) Link
[33]{Nous Research. "Measuring Thinking Efficiency in Reasoning Models: The Missing Benchmark." (2025) Link
[34]{OpenAI. "Reasoning effort parameter." (2026)

Ready for better data?

Let’s make models better through better data, automatically.

Book a Call