Every atomic assertion extracted from the underlying record, ranked by evidence strength.
Trimming is a simple method that requires no retraining and runs on a simple CPU.
Trimming produces a lighter model than the original while maintaining its performance.
A model's vocabulary size may not be relevant if it is multilingual, as unnecessary languages can be removed for a specific use case.
A model's vocabulary size may not be relevant if it is not a multiple of 8 or 64, as these multiples are preferred for optimizing GPU usage.
The authors reveal 5,526 models resulting from the application of the trimming technique.
A French version of the blog post is available.
Trimming can be seen as a subset of pruning.
The goal of trimming is to modify/remove model weights to ultimately make it lighter.
Trimming focuses exclusively on the parts of the architecture related to vocabulary.
For trimming, tokens are removed from the model's original vocabulary, and the tokenizer must also be updated.
Trimming modifies the final embedding layer managing the probability distribution of the model's vocabulary, and the input layer if embeddings are tied.
Optimizing vocabulary size to a multiple of 8 or 64 can speed up model training by 25% according to Karpathy's observations.
Since 2023-2024, models generally use a multiple of 8 or 64 for vocabulary by default, but older models may still benefit from modification.
7 of the 16 models tested in this work had a vocabulary size that was not a multiple of 8 or 64.
Reducing vocabulary size reduces model size in terms of both number of parameters and memory size.
The GPT2-small model by RADFORD, WU et al. (2019) has 124,439,808 unique total parameters.
The embedding layer (wte.weight) of GPT2-small has 38,597,376 parameters and a size of [50257, 768].
The wpe.weight layer of GPT2 represents 28.17% of the total model size.
GPT2's vocabulary size of 50,257 is not a multiple of 64, making it a candidate for trimming.
Reducing GPT2's vocabulary from 50,257 to 32,768 tokens (512 × 64) reduces parameters by 13,431,552.
This reduction in GPT2's vocabulary results in a 10.79% reduction in total parameters, from 124,439,808 to 111,008,256.
The article analyzed 16 models covering different architectures and modalities, including text encoders, text encoder-decoders, text decoders, text embedding models, visual embedding models, and text/visual encoder-decoders (VLM).
Geotrend's `smaller-transformers` library is for trimming mBERT, the multilingual version of BERT by DEVLIN et al. (2018).
Trimming is particularly interesting for multilingualism.
David DALE (2021) demonstrated how to trim an mT5 model to retain only English and Russian in a Medium article.
Aditya SRIVASTAVA developed the `hf-trim` library (2022), which claims to support mT5 and mBART but has practical limitations, including inability to control the desired final vocabulary size.
`lm-vocab-trimmer` (2023) by USHIO, ZHOU and CAMACHOS-COLLADS handles mT5, mBART, and XLM-RoBERTa, and is probably the most advanced library on the subject.
`lm-vocab-trimmer` has weaknesses, such as its `target_vocab_size` argument not giving expected results for multiples of 64, and inability to reduce to 'n' languages.
All models supported by `lm-vocab-trimmer` rely on a sentencepiece tokenizer by KUDO and RICHARDSON (2018).
Antoine LOUIS (2024) proposes trimming already fine-tuned embedding models like mE5, BGE, or GTE via a Hugging Face Space.
Antoine LOUIS's trimming method does not allow selecting the desired number of tokens in the final vocabulary and only supports 6 languages.
Existing trimming tools primarily focus on models based on sentencepiece tokenizers, encoders, or encoder-decoders, and make it difficult or impossible to manage the desired vocabulary size.
This work aims to handle models based on other tokenizers (e.g., BPE), other architectures, other modalities, and to allow choosing the size of the new vocabulary.
The models were tested on a diversity of languages with independent evaluators whose language does not necessarily use the Latin alphabet (Korean, Tamil, Arabic among others).
The `smaller-transformers` approach is limited to mBERT and its vocabularies are of different sizes depending on the language, never a multiple of 64.
The work on trimming was a collaboration between various Hugging Face Fellows to evaluate the approach on languages other than English.
Loïck BOURDOIS, Tom AARSEN, Bram VANROY, Christopher AKIKI, Woojun JUNG, Manuel ROMERO, and Prithiv SAKTHI were the Hugging Face Fellows involved in the trimming work.
The company AlphaEdge allowed Loïck BOURDOIS to carry out the trimming work during professional time.
The blog article's length is greatly overestimated due to numerous tables of results, references, or examples of texts.
ModernCamemBERT has an optimized number of tokens per word (1.42) compared to 1.58 for the trimmed version.
The distilled model distilCamemBERT almost matches the original mmBERT model's performance.
A model trimmed to 67,476 tokens, having the same 67.5M parameters as distilCamemBERT, performs equivalently to distillation.
The distilled version of CamemBERT ran for 18 days on a Titan RTX GPU.
The trimmed version was obtained in 8 min 56s on an Intel Core Ultra 7 255H CPU.
The trimmed model handles a context size 16 times longer than the distilled model.
Trimming can be more worthwhile than distillation, and is recommended when the desired parameter reduction is equivalent.
For English, the trimmed mmBERT small with 54.8M parameters is 18.2% smaller than DistilBERT.
The trimmed mmBERT small handles 8,192 tokens compared to 512 for DistilBERT.
DistilBERT required approximately 90 hours of computation on 8 V100 16GB GPUs.
The trimmed mmBERT small took 9 min 01s on an Intel Core Ultra 7 255H CPU for the entire process.
Trimming delivers performance comparable to distillation while being vastly less costly, running in minutes on a CPU versus days on a GPU.
The execution time to obtain a monolingual model from a multilingual one using trimming ranged from 9 to 22 minutes for the entire process.
It is advisable to mine tokens on the smallest model of a family and save the distribution of frequent tokens in a cache to generate larger models more efficiently.
514 models (124 different languages) trimmed from mmBERT are available in a collection.
The mBART model has a vocabulary size that is not a multiple of 64, making trimming particularly well-suited for it.
For encoder-decoders, trimming allows achieving results more or less equivalent to the original model.
Training trimmed encoder-decoder models longer yields a significant gain over the original model due to their smaller size.
392 models (98 different languages) trimmed from mT5 are available in a collection.
104 models (52 different languages) trimmed from mBART are available in a collection.
The e5-NL models were built using the trimming technique.