{
  "kind": "story",
  "slug": "introduction-to-trimming-5340190",
  "id": 1780209819145340190,
  "record_id": 1780205135326704760,
  "headline": "Introduction to Trimming \u2702",
  "summary": "",
  "source": "huggingface-nlp-blog",
  "source_url": "https://huggingface.co/blog/lbourdois/introduction-to-trimming",
  "home_domain": "engineering-technology",
  "claim_type": null,
  "sentiment": "neutral",
  "significance": "medium",
  "claim_count": 114,
  "reading_time_minutes": 10,
  "published_date": "2026-05-28",
  "created_on": "2026-05-31T06:43:38.796448+00:00",
  "claims": [
    {
      "id": 1780209819658946914,
      "text": "Trimming is a simple method that requires no retraining and runs on a simple CPU.",
      "evidence_type": "direct_quote",
      "confidence": "stated",
      "home_domain": "engineering-technology",
      "published_date": "2026-05-28"
    },
    {
      "id": 1780209819682712771,
      "text": "Trimming produces a lighter model than the original while maintaining its performance.",
      "evidence_type": "direct_quote",
      "confidence": "stated",
      "home_domain": "engineering-technology",
      "published_date": "2026-05-28"
    },
    {
      "id": 1780209819730016321,
      "text": "A model's vocabulary size may not be relevant if it is multilingual, as unnecessary languages can be removed for a specific use case.",
      "evidence_type": "direct_quote",
      "confidence": "stated",
      "home_domain": "engineering-technology",
      "published_date": "2026-05-28"
    },
    {
      "id": 1780209819738063209,
      "text": "A model's vocabulary size may not be relevant if it is not a multiple of 8 or 64, as these multiples are preferred for optimizing GPU usage.",
      "evidence_type": "direct_quote",
      "confidence": "stated",
      "home_domain": "engineering-technology",
      "published_date": "2026-05-28"
    },
    {
      "id": 1780209819686985524,
      "text": "The authors reveal 5,526 models resulting from the application of the trimming technique.",
      "evidence_type": "direct_quote",
      "confidence": "stated",
      "home_domain": "engineering-technology",
      "published_date": "2026-05-28"
    },
    {
      "id": 1780209819646634378,
      "text": "A French version of the blog post is available.",
      "evidence_type": "direct_quote",
      "confidence": "stated",
      "home_domain": "engineering-technology",
      "published_date": "2026-05-28"
    },
    {
      "id": 1780209819690697696,
      "text": "Trimming can be seen as a subset of pruning.",
      "evidence_type": "direct_quote",
      "confidence": "stated",
      "home_domain": "engineering-technology",
      "published_date": "2026-05-28"
    },
    {
      "id": 1780209819694176357,
      "text": "The goal of trimming is to modify/remove model weights to ultimately make it lighter.",
      "evidence_type": "direct_quote",
      "confidence": "stated",
      "home_domain": "engineering-technology",
      "published_date": "2026-05-28"
    },
    {
      "id": 1780209819698540457,
      "text": "Trimming focuses exclusively on the parts of the architecture related to vocabulary.",
      "evidence_type": "direct_quote",
      "confidence": "stated",
      "home_domain": "engineering-technology",
      "published_date": "2026-05-28"
    },
    {
      "id": 1780209819702734621,
      "text": "For trimming, tokens are removed from the model's original vocabulary, and the tokenizer must also be updated.",
      "evidence_type": "direct_quote",
      "confidence": "stated",
      "home_domain": "engineering-technology",
      "published_date": "2026-05-28"
    },
    {
      "id": 1780209819718335553,
      "text": "Trimming modifies the final embedding layer managing the probability distribution of the model's vocabulary, and the input layer if embeddings are tied.",
      "evidence_type": "direct_quote",
      "confidence": "stated",
      "home_domain": "engineering-technology",
      "published_date": "2026-05-28"
    },
    {
      "id": 1780209819746552221,
      "text": "Optimizing vocabulary size to a multiple of 8 or 64 can speed up model training by 25% according to Karpathy's observations.",
      "evidence_type": "direct_quote",
      "confidence": "stated",
      "home_domain": "engineering-technology",
      "published_date": "2026-05-28"
    },
    {
      "id": 1780209819750782661,
      "text": "Since 2023-2024, models generally use a multiple of 8 or 64 for vocabulary by default, but older models may still benefit from modification.",
      "evidence_type": "direct_quote",
      "confidence": "stated",
      "home_domain": "engineering-technology",
      "published_date": "2026-05-28"
    },
    {
      "id": 1780209819754411422,
      "text": "7 of the 16 models tested in this work had a vocabulary size that was not a multiple of 8 or 64.",
      "evidence_type": "direct_quote",
      "confidence": "stated",
      "home_domain": "engineering-technology",
      "published_date": "2026-05-28"
    },
    {
      "id": 1780209819762144366,
      "text": "Reducing vocabulary size reduces model size in terms of both number of parameters and memory size.",
      "evidence_type": "direct_quote",
      "confidence": "stated",
      "home_domain": "engineering-technology",
      "published_date": "2026-05-28"
    },
    {
      "id": 1780209819766840316,
      "text": "The GPT2-small model by RADFORD, WU et al. (2019) has 124,439,808 unique total parameters.",
      "evidence_type": "direct_quote",
      "confidence": "stated",
      "home_domain": "engineering-technology",
      "published_date": "2026-05-28"
    },
    {
      "id": 1780209819778986080,
      "text": "The embedding layer (wte.weight) of GPT2-small has 38,597,376 parameters and a size of [50257, 768].",
      "evidence_type": "direct_quote",
      "confidence": "stated",
      "home_domain": "engineering-technology",
      "published_date": "2026-05-28"
    },
    {
      "id": 1780209819786001294,
      "text": "The wpe.weight layer of GPT2 represents 28.17% of the total model size.",
      "evidence_type": "direct_quote",
      "confidence": "stated",
      "home_domain": "engineering-technology",
      "published_date": "2026-05-28"
    },
    {
      "id": 1780209819794905670,
      "text": "GPT2's vocabulary size of 50,257 is not a multiple of 64, making it a candidate for trimming.",
      "evidence_type": "direct_quote",
      "confidence": "stated",
      "home_domain": "engineering-technology",
      "published_date": "2026-05-28"
    },
    {
      "id": 1780209819798319828,
      "text": "Reducing GPT2's vocabulary from 50,257 to 32,768 tokens (512 \u00d7 64) reduces parameters by 13,431,552.",
      "evidence_type": "direct_quote",
      "confidence": "stated",
      "home_domain": "engineering-technology",
      "published_date": "2026-05-28"
    },
    {
      "id": 1780209819802792623,
      "text": "This reduction in GPT2's vocabulary results in a 10.79% reduction in total parameters, from 124,439,808 to 111,008,256.",
      "evidence_type": "direct_quote",
      "confidence": "stated",
      "home_domain": "engineering-technology",
      "published_date": "2026-05-28"
    },
    {
      "id": 1780209819814342812,
      "text": "The article analyzed 16 models covering different architectures and modalities, including text encoders, text encoder-decoders, text decoders, text embedding models, visual embedding models, and text/visual encoder-decoders (VLM).",
      "evidence_type": "direct_quote",
      "confidence": "stated",
      "home_domain": "engineering-technology",
      "published_date": "2026-05-28"
    },
    {
      "id": 1780209819822494167,
      "text": "Geotrend's `smaller-transformers` library is for trimming mBERT, the multilingual version of BERT by DEVLIN et al. (2018).",
      "evidence_type": "direct_quote",
      "confidence": "stated",
      "home_domain": "engineering-technology",
      "published_date": "2026-05-28"
    },
    {
      "id": 1780209819530477035,
      "text": "Trimming is particularly interesting for multilingualism.",
      "evidence_type": "direct_quote",
      "confidence": "stated",
      "home_domain": "engineering-technology",
      "published_date": "2026-05-28"
    },
    {
      "id": 1780209819838467069,
      "text": "David DALE (2021) demonstrated how to trim an mT5 model to retain only English and Russian in a Medium article.",
      "evidence_type": "direct_quote",
      "confidence": "stated",
      "home_domain": "engineering-technology",
      "published_date": "2026-05-28"
    },
    {
      "id": 1780209819840460825,
      "text": "Aditya SRIVASTAVA developed the `hf-trim` library (2022), which claims to support mT5 and mBART but has practical limitations, including inability to control the desired final vocabulary size.",
      "evidence_type": "direct_quote",
      "confidence": "stated",
      "home_domain": "engineering-technology",
      "published_date": "2026-05-28"
    },
    {
      "id": 1780209819854822914,
      "text": "`lm-vocab-trimmer` (2023) by USHIO, ZHOU and CAMACHOS-COLLADS handles mT5, mBART, and XLM-RoBERTa, and is probably the most advanced library on the subject.",
      "evidence_type": "direct_quote",
      "confidence": "stated",
      "home_domain": "engineering-technology",
      "published_date": "2026-05-28"
    },
    {
      "id": 1780209819858246463,
      "text": "`lm-vocab-trimmer` has weaknesses, such as its `target_vocab_size` argument not giving expected results for multiples of 64, and inability to reduce to 'n' languages.",
      "evidence_type": "direct_quote",
      "confidence": "stated",
      "home_domain": "engineering-technology",
      "published_date": "2026-05-28"
    },
    {
      "id": 1780209819862811502,
      "text": "All models supported by `lm-vocab-trimmer` rely on a sentencepiece tokenizer by KUDO and RICHARDSON (2018).",
      "evidence_type": "direct_quote",
      "confidence": "stated",
      "home_domain": "engineering-technology",
      "published_date": "2026-05-28"
    },
    {
      "id": 1780209819874970983,
      "text": "Antoine LOUIS (2024) proposes trimming already fine-tuned embedding models like mE5, BGE, or GTE via a Hugging Face Space.",
      "evidence_type": "direct_quote",
      "confidence": "stated",
      "home_domain": "engineering-technology",
      "published_date": "2026-05-28"
    },
    {
      "id": 1780209819880268282,
      "text": "Antoine LOUIS's trimming method does not allow selecting the desired number of tokens in the final vocabulary and only supports 6 languages.",
      "evidence_type": "direct_quote",
      "confidence": "stated",
      "home_domain": "engineering-technology",
      "published_date": "2026-05-28"
    },
    {
      "id": 1780209819886911858,
      "text": "Existing trimming tools primarily focus on models based on sentencepiece tokenizers, encoders, or encoder-decoders, and make it difficult or impossible to manage the desired vocabulary size.",
      "evidence_type": "direct_quote",
      "confidence": "stated",
      "home_domain": "engineering-technology",
      "published_date": "2026-05-28"
    },
    {
      "id": 1780209819902986850,
      "text": "This work aims to handle models based on other tokenizers (e.g., BPE), other architectures, other modalities, and to allow choosing the size of the new vocabulary.",
      "evidence_type": "direct_quote",
      "confidence": "stated",
      "home_domain": "engineering-technology",
      "published_date": "2026-05-28"
    },
    {
      "id": 1780209819915560325,
      "text": "The models were tested on a diversity of languages with independent evaluators whose language does not necessarily use the Latin alphabet (Korean, Tamil, Arabic among others).",
      "evidence_type": "direct_quote",
      "confidence": "stated",
      "home_domain": "engineering-technology",
      "published_date": "2026-05-28"
    },
    {
      "id": 1780209819826836465,
      "text": "The `smaller-transformers` approach is limited to mBERT and its vocabularies are of different sizes depending on the language, never a multiple of 64.",
      "evidence_type": "direct_quote",
      "confidence": "stated",
      "home_domain": "engineering-technology",
      "published_date": "2026-05-28"
    },
    {
      "id": 1780209819582434099,
      "text": "The work on trimming was a collaboration between various Hugging Face Fellows to evaluate the approach on languages other than English.",
      "evidence_type": "direct_quote",
      "confidence": "stated",
      "home_domain": "engineering-technology",
      "published_date": "2026-05-28"
    },
    {
      "id": 1780209819599748083,
      "text": "Lo\u00efck BOURDOIS, Tom AARSEN, Bram VANROY, Christopher AKIKI, Woojun JUNG, Manuel ROMERO, and Prithiv SAKTHI were the Hugging Face Fellows involved in the trimming work.",
      "evidence_type": "direct_quote",
      "confidence": "stated",
      "home_domain": "engineering-technology",
      "published_date": "2026-05-28"
    },
    {
      "id": 1780209819614462255,
      "text": "The company AlphaEdge allowed Lo\u00efck BOURDOIS to carry out the trimming work during professional time.",
      "evidence_type": "direct_quote",
      "confidence": "stated",
      "home_domain": "engineering-technology",
      "published_date": "2026-05-28"
    },
    {
      "id": 1780209819642907560,
      "text": "The blog article's length is greatly overestimated due to numerous tables of results, references, or examples of texts.",
      "evidence_type": "direct_quote",
      "confidence": "stated",
      "home_domain": "engineering-technology",
      "published_date": "2026-05-28"
    },
    {
      "id": 1780209820251984391,
      "text": "ModernCamemBERT has an optimized number of tokens per word (1.42) compared to 1.58 for the trimmed version.",
      "evidence_type": "direct_quote",
      "confidence": "stated",
      "home_domain": "engineering-technology",
      "published_date": "2026-05-28"
    },
    {
      "id": 1780209820262083326,
      "text": "The distilled model distilCamemBERT almost matches the original mmBERT model's performance.",
      "evidence_type": "direct_quote",
      "confidence": "stated",
      "home_domain": "engineering-technology",
      "published_date": "2026-05-28"
    },
    {
      "id": 1780209820267385407,
      "text": "A model trimmed to 67,476 tokens, having the same 67.5M parameters as distilCamemBERT, performs equivalently to distillation.",
      "evidence_type": "direct_quote",
      "confidence": "stated",
      "home_domain": "engineering-technology",
      "published_date": "2026-05-28"
    },
    {
      "id": 1780209820274212841,
      "text": "The distilled version of CamemBERT ran for 18 days on a Titan RTX GPU.",
      "evidence_type": "direct_quote",
      "confidence": "stated",
      "home_domain": "engineering-technology",
      "published_date": "2026-05-28"
    },
    {
      "id": 1780209820278945290,
      "text": "The trimmed version was obtained in 8 min 56s on an Intel Core Ultra 7 255H CPU.",
      "evidence_type": "direct_quote",
      "confidence": "stated",
      "home_domain": "engineering-technology",
      "published_date": "2026-05-28"
    },
    {
      "id": 1780209820287193893,
      "text": "The trimmed model handles a context size 16 times longer than the distilled model.",
      "evidence_type": "direct_quote",
      "confidence": "stated",
      "home_domain": "engineering-technology",
      "published_date": "2026-05-28"
    },
    {
      "id": 1780209820290873645,
      "text": "Trimming can be more worthwhile than distillation, and is recommended when the desired parameter reduction is equivalent.",
      "evidence_type": "direct_quote",
      "confidence": "stated",
      "home_domain": "engineering-technology",
      "published_date": "2026-05-28"
    },
    {
      "id": 1780209820302576153,
      "text": "For English, the trimmed mmBERT small with 54.8M parameters is 18.2% smaller than DistilBERT.",
      "evidence_type": "direct_quote",
      "confidence": "stated",
      "home_domain": "engineering-technology",
      "published_date": "2026-05-28"
    },
    {
      "id": 1780209820307457994,
      "text": "The trimmed mmBERT small handles 8,192 tokens compared to 512 for DistilBERT.",
      "evidence_type": "direct_quote",
      "confidence": "stated",
      "home_domain": "engineering-technology",
      "published_date": "2026-05-28"
    },
    {
      "id": 1780209820314965709,
      "text": "DistilBERT required approximately 90 hours of computation on 8 V100 16GB GPUs.",
      "evidence_type": "direct_quote",
      "confidence": "stated",
      "home_domain": "engineering-technology",
      "published_date": "2026-05-28"
    },
    {
      "id": 1780209820323704664,
      "text": "The trimmed mmBERT small took 9 min 01s on an Intel Core Ultra 7 255H CPU for the entire process.",
      "evidence_type": "direct_quote",
      "confidence": "stated",
      "home_domain": "engineering-technology",
      "published_date": "2026-05-28"
    },
    {
      "id": 1780209820326980683,
      "text": "Trimming delivers performance comparable to distillation while being vastly less costly, running in minutes on a CPU versus days on a GPU.",
      "evidence_type": "direct_quote",
      "confidence": "stated",
      "home_domain": "engineering-technology",
      "published_date": "2026-05-28"
    },
    {
      "id": 1780209820330353653,
      "text": "The execution time to obtain a monolingual model from a multilingual one using trimming ranged from 9 to 22 minutes for the entire process.",
      "evidence_type": "direct_quote",
      "confidence": "stated",
      "home_domain": "engineering-technology",
      "published_date": "2026-05-28"
    },
    {
      "id": 1780209820338360173,
      "text": "It is advisable to mine tokens on the smallest model of a family and save the distribution of frequent tokens in a cache to generate larger models more efficiently.",
      "evidence_type": "direct_quote",
      "confidence": "stated",
      "home_domain": "engineering-technology",
      "published_date": "2026-05-28"
    },
    {
      "id": 1780209820343324247,
      "text": "514 models (124 different languages) trimmed from mmBERT are available in a collection.",
      "evidence_type": "direct_quote",
      "confidence": "stated",
      "home_domain": "engineering-technology",
      "published_date": "2026-05-28"
    },
    {
      "id": 1780209820366125583,
      "text": "The mBART model has a vocabulary size that is not a multiple of 64, making trimming particularly well-suited for it.",
      "evidence_type": "direct_quote",
      "confidence": "stated",
      "home_domain": "engineering-technology",
      "published_date": "2026-05-28"
    },
    {
      "id": 1780209820407546067,
      "text": "For encoder-decoders, trimming allows achieving results more or less equivalent to the original model.",
      "evidence_type": "direct_quote",
      "confidence": "stated",
      "home_domain": "engineering-technology",
      "published_date": "2026-05-28"
    },
    {
      "id": 1780209820415126897,
      "text": "Training trimmed encoder-decoder models longer yields a significant gain over the original model due to their smaller size.",
      "evidence_type": "direct_quote",
      "confidence": "stated",
      "home_domain": "engineering-technology",
      "published_date": "2026-05-28"
    },
    {
      "id": 1780209820419697519,
      "text": "392 models (98 different languages) trimmed from mT5 are available in a collection.",
      "evidence_type": "direct_quote",
      "confidence": "stated",
      "home_domain": "engineering-technology",
      "published_date": "2026-05-28"
    },
    {
      "id": 1780209820429905853,
      "text": "104 models (52 different languages) trimmed from mBART are available in a collection.",
      "evidence_type": "direct_quote",
      "confidence": "stated",
      "home_domain": "engineering-technology",
      "published_date": "2026-05-28"
    },
    {
      "id": 1780209819935138242,
      "text": "The e5-NL models were built using the trimming technique.",
      "evidence_type": "direct_quote",
      "confidence": "stated",
      "home_domain": "engineering-technology",
      "published_date": "2026-05-28"
    }
  ],
  "tags": [
    {
      "id": 17730927179500411,
      "slug": "alpha-organization",
      "name": "Alpha",
      "type": "organization"
    },
    {
      "id": 17730927206563715,
      "slug": "geo-organization",
      "name": "GEO",
      "type": "organization"
    },
    {
      "id": 17733518056319805,
      "slug": "github-organization",
      "name": "GitHub",
      "type": "organization"
    },
    {
      "id": 17733517869622281,
      "slug": "google-organization",
      "name": "Google",
      "type": "organization"
    },
    {
      "id": 17730927799169010,
      "slug": "hugging-face-organization",
      "name": "Hugging Face",
      "type": "organization"
    },
    {
      "id": 17726459355176948,
      "slug": "hugging-face-hub-organization",
      "name": "Hugging Face Hub",
      "type": "organization"
    },
    {
      "id": 17723038993600058,
      "slug": "ibm-organization",
      "name": "IBM",
      "type": "organization"
    },
    {
      "id": 17733541291350092,
      "slug": "ing-organization",
      "name": "ING",
      "type": "organization"
    },
    {
      "id": 17723038993791919,
      "slug": "medium-organization",
      "name": "Medium",
      "type": "organization"
    },
    {
      "id": 17724261268437304,
      "slug": "qwen-organization",
      "name": "Qwen",
      "type": "organization"
    },
    {
      "id": 17723038993713656,
      "slug": "andrej-karpathy-person",
      "name": "Andrej Karpathy",
      "type": "person"
    },
    {
      "id": 17733572036984785,
      "slug": "a-srivastava-person",
      "name": "A. Srivastava",
      "type": "person"
    },
    {
      "id": 17802086243266456,
      "slug": "bram-vanroy-person",
      "name": "Bram Vanroy",
      "type": "person"
    },
    {
      "id": 17802086679239350,
      "slug": "christopher-akiki-person",
      "name": "Christopher Akiki",
      "type": "person"
    },
    {
      "id": 17731272620291828,
      "slug": "david-dali-person",
      "name": "David Dali",
      "type": "person"
    },
    {
      "id": 17733572950783552,
      "slug": "kim-et-al-person",
      "name": "Kim et al.",
      "type": "person"
    },
    {
      "id": 17733546771746238,
      "slug": "lin-et-al-person",
      "name": "Lin et al.",
      "type": "person"
    },
    {
      "id": 17802086059300390,
      "slug": "lo-ck-bourdois-person",
      "name": "Lo\u00efck Bourdois",
      "type": "person"
    },
    {
      "id": 17724187601100123,
      "slug": "manuel-romo-person",
      "name": "Manuel Romo",
      "type": "person"
    },
    {
      "id": 17798441097940386,
      "slug": "tom-aarsen-person",
      "name": "Tom Aarsen",
      "type": "person"
    },
    {
      "id": 17733547445896752,
      "slug": "wang-et-al-person",
      "name": "Wang, et al.",
      "type": "person"
    },
    {
      "id": 17779356841416366,
      "slug": "zhang-et-al-person",
      "name": "Zhang et al.",
      "type": "person"
    },
    {
      "id": 17723038993834764,
      "slug": "artificial-intelligence-topic",
      "name": "Artificial Intelligence",
      "type": "topic"
    },
    {
      "id": 17730981183425062,
      "slug": "distillation-topic",
      "name": "Distillation",
      "type": "topic"
    },
    {
      "id": 17791452102628180,
      "slug": "inference-optimization-topic",
      "name": "Inference Optimization",
      "type": "topic"
    },
    {
      "id": 17730948119041167,
      "slug": "multimodal-ai-topic",
      "name": "Multimodal AI",
      "type": "topic"
    },
    {
      "id": 17791452102923593,
      "slug": "quantization-topic",
      "name": "Quantization",
      "type": "topic"
    }
  ]
}