Introduction to Trimming ✂

Trimming is a simple method that requires no retraining and runs on a simple CPU.

direct_quotestatedengineering-technologyMay 28, 2026

Trimming produces a lighter model than the original while maintaining its performance.

direct_quotestatedengineering-technologyMay 28, 2026

A model's vocabulary size may not be relevant if it is multilingual, as unnecessary languages can be removed for a specific use case.

direct_quotestatedengineering-technologyMay 28, 2026

A model's vocabulary size may not be relevant if it is not a multiple of 8 or 64, as these multiples are preferred for optimizing GPU usage.

direct_quotestatedengineering-technologyMay 28, 2026

The authors reveal 5,526 models resulting from the application of the trimming technique.

direct_quotestatedengineering-technologyMay 28, 2026

A French version of the blog post is available.

direct_quotestatedengineering-technologyMay 28, 2026

Trimming can be seen as a subset of pruning.

direct_quotestatedengineering-technologyMay 28, 2026

The goal of trimming is to modify/remove model weights to ultimately make it lighter.

direct_quotestatedengineering-technologyMay 28, 2026

Trimming focuses exclusively on the parts of the architecture related to vocabulary.

direct_quotestatedengineering-technologyMay 28, 2026

For trimming, tokens are removed from the model's original vocabulary, and the tokenizer must also be updated.

direct_quotestatedengineering-technologyMay 28, 2026

Trimming modifies the final embedding layer managing the probability distribution of the model's vocabulary, and the input layer if embeddings are tied.

direct_quotestatedengineering-technologyMay 28, 2026

Optimizing vocabulary size to a multiple of 8 or 64 can speed up model training by 25% according to Karpathy's observations.

direct_quotestatedengineering-technologyMay 28, 2026

Since 2023-2024, models generally use a multiple of 8 or 64 for vocabulary by default, but older models may still benefit from modification.

direct_quotestatedengineering-technologyMay 28, 2026

7 of the 16 models tested in this work had a vocabulary size that was not a multiple of 8 or 64.

direct_quotestatedengineering-technologyMay 28, 2026

Reducing vocabulary size reduces model size in terms of both number of parameters and memory size.

direct_quotestatedengineering-technologyMay 28, 2026

The GPT2-small model by RADFORD, WU et al. (2019) has 124,439,808 unique total parameters.

direct_quotestatedengineering-technologyMay 28, 2026

The embedding layer (wte.weight) of GPT2-small has 38,597,376 parameters and a size of [50257, 768].

direct_quotestatedengineering-technologyMay 28, 2026

The wpe.weight layer of GPT2 represents 28.17% of the total model size.

direct_quotestatedengineering-technologyMay 28, 2026

GPT2's vocabulary size of 50,257 is not a multiple of 64, making it a candidate for trimming.

direct_quotestatedengineering-technologyMay 28, 2026

Reducing GPT2's vocabulary from 50,257 to 32,768 tokens (512 × 64) reduces parameters by 13,431,552.

direct_quotestatedengineering-technologyMay 28, 2026

This reduction in GPT2's vocabulary results in a 10.79% reduction in total parameters, from 124,439,808 to 111,008,256.

direct_quotestatedengineering-technologyMay 28, 2026

The article analyzed 16 models covering different architectures and modalities, including text encoders, text encoder-decoders, text decoders, text embedding models, visual embedding models, and text/visual encoder-decoders (VLM).

direct_quotestatedengineering-technologyMay 28, 2026

Geotrend's `smaller-transformers` library is for trimming mBERT, the multilingual version of BERT by DEVLIN et al. (2018).

direct_quotestatedengineering-technologyMay 28, 2026

Trimming is particularly interesting for multilingualism.

direct_quotestatedengineering-technologyMay 28, 2026

David DALE (2021) demonstrated how to trim an mT5 model to retain only English and Russian in a Medium article.

direct_quotestatedengineering-technologyMay 28, 2026

Aditya SRIVASTAVA developed the `hf-trim` library (2022), which claims to support mT5 and mBART but has practical limitations, including inability to control the desired final vocabulary size.

direct_quotestatedengineering-technologyMay 28, 2026

`lm-vocab-trimmer` (2023) by USHIO, ZHOU and CAMACHOS-COLLADS handles mT5, mBART, and XLM-RoBERTa, and is probably the most advanced library on the subject.

direct_quotestatedengineering-technologyMay 28, 2026

`lm-vocab-trimmer` has weaknesses, such as its `target_vocab_size` argument not giving expected results for multiples of 64, and inability to reduce to 'n' languages.

direct_quotestatedengineering-technologyMay 28, 2026

All models supported by `lm-vocab-trimmer` rely on a sentencepiece tokenizer by KUDO and RICHARDSON (2018).

direct_quotestatedengineering-technologyMay 28, 2026

Antoine LOUIS (2024) proposes trimming already fine-tuned embedding models like mE5, BGE, or GTE via a Hugging Face Space.

direct_quotestatedengineering-technologyMay 28, 2026

Antoine LOUIS's trimming method does not allow selecting the desired number of tokens in the final vocabulary and only supports 6 languages.

direct_quotestatedengineering-technologyMay 28, 2026

Existing trimming tools primarily focus on models based on sentencepiece tokenizers, encoders, or encoder-decoders, and make it difficult or impossible to manage the desired vocabulary size.

direct_quotestatedengineering-technologyMay 28, 2026

This work aims to handle models based on other tokenizers (e.g., BPE), other architectures, other modalities, and to allow choosing the size of the new vocabulary.

direct_quotestatedengineering-technologyMay 28, 2026

The models were tested on a diversity of languages with independent evaluators whose language does not necessarily use the Latin alphabet (Korean, Tamil, Arabic among others).

direct_quotestatedengineering-technologyMay 28, 2026

The `smaller-transformers` approach is limited to mBERT and its vocabularies are of different sizes depending on the language, never a multiple of 64.

direct_quotestatedengineering-technologyMay 28, 2026

The work on trimming was a collaboration between various Hugging Face Fellows to evaluate the approach on languages other than English.

direct_quotestatedengineering-technologyMay 28, 2026

Loïck BOURDOIS, Tom AARSEN, Bram VANROY, Christopher AKIKI, Woojun JUNG, Manuel ROMERO, and Prithiv SAKTHI were the Hugging Face Fellows involved in the trimming work.

direct_quotestatedengineering-technologyMay 28, 2026

The company AlphaEdge allowed Loïck BOURDOIS to carry out the trimming work during professional time.

direct_quotestatedengineering-technologyMay 28, 2026

The blog article's length is greatly overestimated due to numerous tables of results, references, or examples of texts.

direct_quotestatedengineering-technologyMay 28, 2026

ModernCamemBERT has an optimized number of tokens per word (1.42) compared to 1.58 for the trimmed version.

direct_quotestatedengineering-technologyMay 28, 2026

The distilled model distilCamemBERT almost matches the original mmBERT model's performance.

direct_quotestatedengineering-technologyMay 28, 2026

A model trimmed to 67,476 tokens, having the same 67.5M parameters as distilCamemBERT, performs equivalently to distillation.

direct_quotestatedengineering-technologyMay 28, 2026

The distilled version of CamemBERT ran for 18 days on a Titan RTX GPU.

direct_quotestatedengineering-technologyMay 28, 2026

The trimmed version was obtained in 8 min 56s on an Intel Core Ultra 7 255H CPU.

direct_quotestatedengineering-technologyMay 28, 2026

The trimmed model handles a context size 16 times longer than the distilled model.

direct_quotestatedengineering-technologyMay 28, 2026

Trimming can be more worthwhile than distillation, and is recommended when the desired parameter reduction is equivalent.

direct_quotestatedengineering-technologyMay 28, 2026

For English, the trimmed mmBERT small with 54.8M parameters is 18.2% smaller than DistilBERT.

direct_quotestatedengineering-technologyMay 28, 2026

The trimmed mmBERT small handles 8,192 tokens compared to 512 for DistilBERT.

direct_quotestatedengineering-technologyMay 28, 2026

DistilBERT required approximately 90 hours of computation on 8 V100 16GB GPUs.

direct_quotestatedengineering-technologyMay 28, 2026

The trimmed mmBERT small took 9 min 01s on an Intel Core Ultra 7 255H CPU for the entire process.

direct_quotestatedengineering-technologyMay 28, 2026

Trimming delivers performance comparable to distillation while being vastly less costly, running in minutes on a CPU versus days on a GPU.

direct_quotestatedengineering-technologyMay 28, 2026

The execution time to obtain a monolingual model from a multilingual one using trimming ranged from 9 to 22 minutes for the entire process.

direct_quotestatedengineering-technologyMay 28, 2026

It is advisable to mine tokens on the smallest model of a family and save the distribution of frequent tokens in a cache to generate larger models more efficiently.

direct_quotestatedengineering-technologyMay 28, 2026

514 models (124 different languages) trimmed from mmBERT are available in a collection.

direct_quotestatedengineering-technologyMay 28, 2026

The mBART model has a vocabulary size that is not a multiple of 64, making trimming particularly well-suited for it.

direct_quotestatedengineering-technologyMay 28, 2026

For encoder-decoders, trimming allows achieving results more or less equivalent to the original model.

direct_quotestatedengineering-technologyMay 28, 2026

Training trimmed encoder-decoder models longer yields a significant gain over the original model due to their smaller size.

direct_quotestatedengineering-technologyMay 28, 2026

392 models (98 different languages) trimmed from mT5 are available in a collection.

direct_quotestatedengineering-technologyMay 28, 2026

104 models (52 different languages) trimmed from mBART are available in a collection.

direct_quotestatedengineering-technologyMay 28, 2026

The e5-NL models were built using the trimming technique.

direct_quotestatedengineering-technologyMay 28, 2026

Introduction to Trimming ✂

Claims from this story