Introducing Luxical Embeddings And The Luxical-One Model

NOTE: The ArXiv version of this post can be found here.

Executive Summary

Today we release Luxical, a software library we have developed at DatologyAI for creating tiny, ultra-fast “lexical-dense” text embeddings. We also release Luxical-One, a powerful Luxical model designed to turbocharge English-language text curation at scale. These new technologies, which we are sharing with the community under the permissive Apache 2.0 license, enable order-of-magnitude accelerations to many common data curation workloads including clustering, classification, and semantic deduplication. In this blogpost we discuss our motivation for inventing Luxical, demonstrate the positive impact it has on our state-of-the-art data curation platform through two illustrative experiments, and walk through the design decisions and implementation details that make Luxical so effective.

Introduction

Luxical marries the efficiency of lexical (word-based) text processing to the flexibility and end-to-end trainability of dense neural networks to deliver a “best of both worlds” combination of speed and representational accuracy well-suited to many practical workloads. Luxical models can process millions of tokens per second on a laptop, smoothly power through arbitrarily long documents, and produce rich enough representations to keep up with popular transformer-based language models on numerous mission-critical data curation tasks.

We built Luxical specifically to help us organize and process web-scale datasets, meeting the needs of demanding customers like Arcee in curating trillions of high-quality pretraining tokens from tens of trillions of source tokens. By contrast, much of recent embedding research has focused on claiming the top-spot on jack-of-all-trades benchmarks like the MTEB. While much progress has been made on improving MTEB scores, leaderboard-topping models are often a bad fit for web-scale data processing pipelines, as they typically focus on fine-grained relevance rankings, short-query understanding, and high-score-at-all-costs scaling up of model size – all of which are valuable qualities for certain text embedding applications but tend to be less impactful for web-scale data organization tasks.

Our mission at DatologyAI is to curate the world’s best foundation-scale training data. As such, we’re less interested in maximizing MTEB performance and more interested in a different set of needs: achieving high throughput, accurately capturing symmetrical (document-to-document) relationships, nailing coarse-grained organization (e.g. separating teaching materials from videogame reviews), and gracefully accepting very long documents from the long tail of webcrawl data. Since Luxical is designed to deliver what we need in practice at DatologyAI, we believe this new software tool will be of broad interest to practitioners working with web-scale data.

Luxical In Action: Data Curation Impact

Every LLM training run bears a hefty price tag in compute alone, even before considerations like experimental iteration, development ergonomics, and infrastructural unreliability act as cost multipliers. Luckily, data curation can have a big impact on model quality. Much of web data is redundant, out-of-distribution with regard to end tasks, or harmful, and learning from it wastes compute [1, 2]. Conversely, some web data is high quality, and properly leveraging this data can serve as a powerful compute multiplier. By curating the right tokens, teams can multiply their investments to train much stronger models.

Unfortunately, web-scale data curation is not a straightforward task, often due to the sheer scale of the problem (even simple problems like image duplicate detection become engineering odysseys at petabyte scale [3]). Luxical is a prime example of the creativity and careful system design needed to scale curation workloads to orders of magnitude greater efficiency. In two illustrative test-cases below, we will show how Luxical’s unique design delivers massive speedups and orders-of-magnitude cost reductions across multiple curation workflows without degrading curation quality. Not only does this breakthrough make our platform faster and more cost efficient for our customers, it also leads to big wins in internal research velocity, accelerating the rate at which we make our technology better.

Test Case 1: Matching Similar Documents

In data curation, it is often important to group documents by semantic similarity (during semantic deduplication [4], for example). To test how well various text embedding strategies capture document-to-document semantic similarity, we prepared a text matching task with known correct answers by splitting 50,000 documents from the FineWeb dataset into 100,000 document-halves. Since text chunks from the same document are generally semantically much more similar to one another than they are to other random text chunks sampled from FineWeb, we expect a well-trained embedding model to embed the majority of document halves within the top 1% or so of nearest vectors to their match’s embedding vector. To empirically measure how an embedding model lives up to this expectation, we simply embed all our document halves and count how often same-source chunks are scored in the top x% of similarity to their matches.

In addition to measuring the efficacy of the embedding geometry of each method, we also measure throughput. For the throughput test, we tasked each embedding model with embedding 100,000 complete FineWeb documents via one or both of two different chips: an Apple M4 Max CPU (on a Datologist's laptop), and an NVIDIA A10G GPU (on a p5 AWS EC2 instance).

As we see from the plot above, even with GPU acceleration the Qwen model lags behind Luxical-One by a throughput factor of nearly 100x. While the 60x-smaller MiniLM transformer model runs substantially faster than the Qwen model, its throughput still lags far behind Luxical-One, especially when we consider an apples-to-apples comparison on CPU.

While this extra throughput is impressive, it is important to ensure that the embeddings produced by the model are still fit-for-purpose. We thus turn our attention back to the document-half-matching task at hand.

From this quality evaluation, we see that at all but the top-1 retrieval task, Luxical-One delivers an equal or lesser error rate relative to the much slower MiniLM-L6-v2 model. As we look from the left of the plot towards the right side, we further see that at coarse-grained windows Luxical-One’s error rates approach those of SOTA-on-MTEB models like Qwen3. In data curation retrieval workloads, we often take more than just a few documents to feed our hungry LLMs, so achieving good relevance rankings in the top 0.1% or 1% is where much of our attention goes. Seeing that Luxical performs these more coarse-grained rankings competitively to LLM-based embedding models is thus quite a welcome result for us.

Sidenote: Astute readers may have noticed that the 0.01B active parameters of Luxical-One have an asterisk on the quality plot above. This is there because the architectural differences of Luxical models make them an apples-to-oranges comparison against transformer models. Over 90% of Luxical-One’s active parameters take the form of a single linear layer which runs far more efficiently per-parameter than the attention operations in the transformers models.

Test Case 2: Classifier-Based Filtering

Though the document-half-matching task gives us a deep view into the alignment between text semantics and embedding geometry on the kind of webcrawl data we curate in practice, it falls short of measuring the downstream impact of application to actual curation workflows. To give a complete view, we will focus our second test case on a popular and straightforward curation task: text classification for data filtering. Several popular open data curation efforts, including FineWeb-Edu and DataComp-LM (DCLM) [5, 2], have shown that supervised text classifiers can effectively filter out smaller high-quality subsets of documents from larger source corpora, leading to dramatic gains in downstream model performance on certain knowledge and reasoning tasks.

The FineWeb-Edu filter consists of a classification head trained on top of a transformer-based text embedding model, while DCLM employs a lexical FastText classifier. Luxical, being both a lexical method like FastText and an embedding modeling approach like the one underpinning the FineWeb-Edu scorer, is well-poised to applications in this space. To compare Luxical-One against both the FineWeb-Edu and DCLM scorers for curating web data, we train a small Multi-Layer Perceptron (MLP) classifier to map Luxical-One embeddings to a set of quality annotations constructed in an attempt to fall somewhere in the middle of those used by the two baseline methods. We emphasize that the goal in this experiment is not to train a state-of-the-art text scorer, but to compare the performance characteristics of existing scorers to a Luxical-One-based scorer trained using similar data.

As before, we begin by looking at the throughput. When taking into account the regex substitutions typically required when running FastText on webcrawl data, FastText classification throughput actually falls slightly short of the throughput achieved by feeding Luxical-One embeddings into a tiny MLP. Trailing far behind these two lexical methods is the FineWeb-Edu scorer, clocking in at less than a tenth the throughput despite GPU acceleration.

Is the slower throughput of the transformer model worth it? Is Luxical-One with a tiny MLP as effective as an end-to-end trained FastText classifier? To answer these questions we filter a 600B-token random subset of FineWeb into 60B-token subsets by selecting the top-scoring documents according to each of the quality classifiers. Training a 3B parameter dense transformer on each of these curated datasets, we see that small fast classifiers like FastText and Luxical deliver downstream model quality just as high as the quality delivered by the larger, slower FineWeb-Edu scorer.

In the chart above, we see that across an array of several filter-sensitive benchmarks, the specifics of the quality filters studied have a muted impact relative to the large impact of using a filter (despite both algorithm and training data varying filter-to-filter). This outcome aligns with the intuition of recent research highlights how quality filtering is as much a task in filtering out a small subset of low-quality data as it is in pulling in a small subset of high-quality data [6]. These results suggests that in practice the best choice of text classification pipeline is the one which allows for the fastest runtime and researcher iteration speed.

How Luxical Works Under The Hood

Architecture

Although lexical methods of natural language processing (those derived from the analysis and processing of word statistics) receive less hype these days than transformer neural networks, lexical methods still rightfully underpin many important real-world applications due to their incredible efficiency and reliability. Inspired by the successes that the FastText library has delivered to numerous projects in the data curation space, Luxical also adopts a bag-of-ngrams featurization as the first step of processing each text document. For greater modularity and precision, Luxical integrates with the tokenizers package to support a variety of tokenizers instead of relying on a fast-but-strict whitespace-based tokenization strategy like FastText. And instead of using the hashing trick [7] to approximate ngram counts, Luxical calls for a user-provided set of ngrams up front, allowing users to mine, prune, and customize the featurization offline before training the model.

From a bag-of-ngrams feature vector, how do we move to a dense embedding vector? As numerous other works (FastText, Static Embeddings, and Model2Vec [8, 9, 10], and likely more) have shown, a weighted pooling of learned vocabulary embeddings is a powerful trick. In mathematical terms, this is just a linear projection, though when implementing this operation we take pains to exploit the sparsity of this projection to make this matrix multiplication computationally efficient (for Luxical, we wrote a custom CPU kernel in numba to make this operation go brrr). If you look closely at the diagram above, you’ll see Luxical also pre-scales the bag-of-ngrams representation using log-scaled inverse document frequency, i.e. using the Term-Frequency Inverse-Document-Frequency (TF-IDF) trick to account for the skew of overall term frequencies in natural language. In practice we found that this trick keeps the scale of learned parameters well-regulated, helping smooth and accelerate training.

Contrary to prior works like Model2Vec and Static Embeddings, Luxical supports additional ReLU neural network layers to increase expressive power beyond this initial sparse-to-dense projection. For example, Luxical-One uses additional layers to project up to a higher dimensionality, apply a few projections with nonlinearities, and then project back down to obtain small output vectors. This design balances expressive power, speed, and efficiently-small output vectors (which are friendly to storage and memory). In practice, modestly-large matrix multiplications are so efficient on modern hardware (even CPUs), that additional layers induce only a small decrease to overall throughput when we take into account the heavy-lifting involved in the tokenization of input text, so the additional flexibility and output compression afforded by Luxical can be almost cost-free in terms of throughput.

Training

Luxical follows an end-to-end contrastive knowledge distillation training process that maximizes the amount of fine-grained useful information provided in the target labels and eliminates the need for labeled training data in your domain of interest – a large well-performing teacher model is all you need!

Inspired by the tremendous advances in LLMs over the past few years, we designed Luxical to leverage the benefits of large transformer models as much as possible while retaining lightning-fast speed. To this end, rather than running end-to-end training on standard text embedding datasets like the Static Embeddings project or post-processing the result of forward-passing individual terms through a pretrained large transformer embedding model like the Model2Vec project, Luxical performs end-to-end learning via knowledge distillation from a teacher model. This design also makes training simple to set up: just sample documents from the domain you want the model to embed well, embed these using a large teacher model, and then leave your laptop running for a few hours to distill the information in the teacher embeddings into your new Luxical model. No labeled data needed, you can just recycle an LLM training corpus!

For maximum flexibility, Luxical implements embedding knowledge distillation through a distribution-matching KL-divergence loss that penalizes divergence between the item-to-item similarity matrix of the student away from that induced by the teacher's embeddings (formally, these similarity matrices are Gram matrices). While other distillation works like Stella/Jasper from DunnZhang [11] and LEAF from Mongo DB research [12] have found success in directly aligning student and teacher vectors through l2 loss, divergence loss allows for different dimensionalities between student and teacher and enables fine-grained control over temperature scaling. In training Luxical-One, we used a smaller student embedding dimension for more compressible embeddings and selected a high temperature KL-Divergence loss to encourage the model to favor coarse-grained relevance calibration over fine-grained ranking.

Luxical-One: Letting Our First Luxical Model Loose

To make it easy for the community to dive in and benefit from the advances of Luxical modeling, we have released a powerful English-language Luxcal model, Luxical-One, under the Apache 2.0 license. It's available on Huggingface, and getting started with Luxical-One is as easy as:

# Setup: `pip install luxical`

from transformers import AutoModel

luxical_one = AutoModel.from_pretrained("datologyai/luxical-one", trust_remote_code=True)
example_text = "Luxical-One is fast and integrates easily with Huggingface."
embedding_ndarray = luxical_one([example_text]).embeddings

At a glance, this model:

Adopts the BERT uncased tokenizer
Leverages a vocabulary of 2 million of the most common 5-grams in FineWeb
Outputs compact 192-dimensional embedding vectors
Was trained for 3 epochs over a sample of 50 million FineWeb documents
Utilized the snowflake-arctic-embed-m-v2.0 model as the teacher model

Since Luxical-One delivers high-quality, high-throughput embeddings of English web documents on just laptop compute, it takes only minutes to try it on your next text analysis project. We encourage you to give it a spin yourself and share your experience!

The Future Of Luxical

At DatologyAI we have already shifted multiple curation workflows over to Luxical embeddings, and we are actively using it to accelerate and improve our ongoing research efforts. Our curation technology already delivers state-of-the-art results across language and image-language curation, and it is getting more powerful every day though advances like Luxical. If you are interested in harnessing the power of our data curation to improve your models, get in touch! We are also actively hiring for a number of technical roles, so if building the future of AI through data-focused research and web-scale data processing pipelines excites you, check out our careers page!

Introducing Luxical Embeddings

Executive Summary

Introduction

Luxical In Action: Data Curation Impact

Test Case 1: Matching Similar Documents

Test Case 2: Classifier-Based Filtering

How Luxical Works Under The Hood

Architecture

Training

Luxical-One: Letting Our First Luxical Model Loose

The Future Of Luxical

References

Learn more

BeyondWeb: Lessons from Scaling Synthetic Data for Trillion-scale Pretraining

🌞 Summer of Data Seminar: Diving Deep into Data Research

CLIP Gets a Data Upgrade: Outperforming SoTA with Improved Data Curation Only

Ready for better data?