{
  "kind": "story",
  "slug": "shipping-a-trillion-parameters-with-a-hub-bucket-delta-weigh-4193809",
  "id": 1780210520924193809,
  "record_id": 1780206028780226189,
  "headline": "Shipping a Trillion Parameters With a Hub Bucket: Delta Weight Sync in TRL",
  "summary": "",
  "source": "huggingface-rl-blog",
  "source_url": "https://huggingface.co/blog/delta-weight-sync",
  "home_domain": "engineering-technology",
  "claim_type": null,
  "sentiment": "positive",
  "significance": "high",
  "claim_count": 117,
  "reading_time_minutes": 9,
  "published_date": "2026-05-27",
  "created_on": "2026-05-31T06:55:20.650008+00:00",
  "claims": [
    {
      "id": 1780210521254896034,
      "text": "For a frontier 1T model checkpoint, shipping the whole model is on the order of a terabyte per step.",
      "evidence_type": "direct_quote",
      "confidence": "measured",
      "home_domain": "engineering-technology",
      "published_date": "2026-05-27"
    },
    {
      "id": 1780210521222011709,
      "text": "Async RL has a dirty secret: every step, the trainer has to ship the whole model to the inference engine.",
      "evidence_type": "direct_quote",
      "confidence": "stated",
      "home_domain": "engineering-technology",
      "published_date": "2026-05-27"
    },
    {
      "id": 1780210521286693849,
      "text": "A TRL PR was landed that encodes just the changed elements as a sparse safetensors file.",
      "evidence_type": "direct_quote",
      "confidence": "stated",
      "home_domain": "engineering-technology",
      "published_date": "2026-05-27"
    },
    {
      "id": 1780210521310943046,
      "text": "A full disaggregated training was run where the trainer was on one box, vLLM lived in a Hugging Face Space, and the Wordle environment lived in another Space.",
      "evidence_type": "direct_quote",
      "confidence": "stated",
      "home_domain": "engineering-technology",
      "published_date": "2026-05-27"
    },
    {
      "id": 1780210521306795579,
      "text": "On Qwen3-0.6B, the per-step payload drops from 1.2 GB to 20 to 35 MB using delta weight sync.",
      "evidence_type": "direct_quote",
      "confidence": "measured",
      "home_domain": "engineering-technology",
      "published_date": "2026-05-27"
    },
    {
      "id": 1780210521318678556,
      "text": "The disaggregated training setup required no shared cluster, no RDMA, and no VPN.",
      "evidence_type": "direct_quote",
      "confidence": "stated",
      "home_domain": "engineering-technology",
      "published_date": "2026-05-27"
    },
    {
      "id": 1780210521322364901,
      "text": "Async RL training has become significantly cheaper due to delta weight sync.",
      "evidence_type": "direct_quote",
      "confidence": "stated",
      "home_domain": "engineering-technology",
      "published_date": "2026-05-27"
    },
    {
      "id": 1780210521430863698,
      "text": "The PULSE paper (Mihai & Belilovsky, 2026) formalizes the argument for bf16 weight sparsity.",
      "evidence_type": "direct_quote",
      "confidence": "stated",
      "home_domain": "engineering-technology",
      "published_date": "2026-05-27"
    },
    {
      "id": 1780210521482012693,
      "text": "The Python interface for Hugging Face Buckets uses `batch_bucket_files` and `download_bucket_files` functions.",
      "evidence_type": "direct_quote",
      "confidence": "stated",
      "home_domain": "engineering-technology",
      "published_date": "2026-05-27"
    },
    {
      "id": 1780210521437981671,
      "text": "The bf16 visibility threshold is |w|/256.",
      "evidence_type": "direct_quote",
      "confidence": "measured",
      "home_domain": "engineering-technology",
      "published_date": "2026-05-27"
    },
    {
      "id": 1780210521258112135,
      "text": "Between two consecutive RL optimizer steps, roughly 99% of bf16 weights are bit-identical.",
      "evidence_type": "direct_quote",
      "confidence": "measured",
      "home_domain": "engineering-technology",
      "published_date": "2026-05-27"
    },
    {
      "id": 1780210521274317338,
      "text": "The percentage of bit-identical bf16 weights between consecutive RL optimizer steps is never less than 98% in the worst case.",
      "evidence_type": "direct_quote",
      "confidence": "measured",
      "home_domain": "engineering-technology",
      "published_date": "2026-05-27"
    },
    {
      "id": 1780210521278840348,
      "text": "The actual delta (changed elements) between consecutive RL optimizer steps is tiny.",
      "evidence_type": "direct_quote",
      "confidence": "stated",
      "home_domain": "engineering-technology",
      "published_date": "2026-05-27"
    },
    {
      "id": 1780210521290064545,
      "text": "The sparse safetensors file is uploaded to a Hugging Face Bucket.",
      "evidence_type": "direct_quote",
      "confidence": "stated",
      "home_domain": "engineering-technology",
      "published_date": "2026-05-27"
    },
    {
      "id": 1780210521298856099,
      "text": "vLLM is instructed to fetch the sparse safetensors file from the Hugging Face Bucket.",
      "evidence_type": "direct_quote",
      "confidence": "stated",
      "home_domain": "engineering-technology",
      "published_date": "2026-05-27"
    },
    {
      "id": 1780210521246062536,
      "text": "For a 7B model in bf16, shipping the whole model is 14 GB per step.",
      "evidence_type": "direct_quote",
      "confidence": "measured",
      "home_domain": "engineering-technology",
      "published_date": "2026-05-27"
    },
    {
      "id": 1780210521398674501,
      "text": "Most of the weights have not actually changed between two adjacent RL steps.",
      "evidence_type": "direct_quote",
      "confidence": "stated",
      "home_domain": "engineering-technology",
      "published_date": "2026-05-27"
    },
    {
      "id": 1780210521406016856,
      "text": "Sending only the changed parts reduces bandwidth bill by roughly two orders of magnitude.",
      "evidence_type": "direct_quote",
      "confidence": "stated",
      "home_domain": "engineering-technology",
      "published_date": "2026-05-27"
    },
    {
      "id": 1780210521410751559,
      "text": "Routing tiny diffs through a shared object store eliminates the need for trainer and inference cluster to be in the same data center.",
      "evidence_type": "direct_quote",
      "confidence": "stated",
      "home_domain": "engineering-technology",
      "published_date": "2026-05-27"
    },
    {
      "id": 1780210521314178259,
      "text": "Weights flowed through a single Hub bucket in the disaggregated training setup.",
      "evidence_type": "direct_quote",
      "confidence": "stated",
      "home_domain": "engineering-technology",
      "published_date": "2026-05-27"
    },
    {
      "id": 1780210521474001095,
      "text": "The system observes which bytes flipped to determine the change mask, rather than predicting it analytically.",
      "evidence_type": "paraphrase",
      "confidence": "stated",
      "home_domain": "engineering-technology",
      "published_date": "2026-05-27"
    },
    {
      "id": 1780210521478580806,
      "text": "A Hugging Face Bucket is a repo type on the Hub designed for high-frequency object storage without commit ceremony or PR workflow.",
      "evidence_type": "paraphrase",
      "confidence": "stated",
      "home_domain": "engineering-technology",
      "published_date": "2026-05-27"
    },
    {
      "id": 1780210521486004461,
      "text": "Hugging Face Buckets are backed by Xet, the Hub's content-defined chunking storage layer.",
      "evidence_type": "paraphrase",
      "confidence": "stated",
      "home_domain": "engineering-technology",
      "published_date": "2026-05-27"
    },
    {
      "id": 1780210521494503774,
      "text": "Xet deduplicates uploaded files against everything already in the bucket by slicing them into content-defined chunks.",
      "evidence_type": "paraphrase",
      "confidence": "stated",
      "home_domain": "engineering-technology",
      "published_date": "2026-05-27"
    },
    {
      "id": 1780210521498577657,
      "text": "Even if full anchors were uploaded every step, Xet would only transfer the changed chunks.",
      "evidence_type": "paraphrase",
      "confidence": "stated",
      "home_domain": "engineering-technology",
      "published_date": "2026-05-27"
    },
    {
      "id": 1780210521502069114,
      "text": "The Hugging Face Bucket approach is an open-source equivalent of the 'shared S3 bucket' used by Fireworks and Cursor.",
      "evidence_type": "paraphrase",
      "confidence": "stated",
      "home_domain": "engineering-technology",
      "published_date": "2026-05-27"
    },
    {
      "id": 1780210521506355417,
      "text": "The Hub's storage layer (Xet) knows about content hashing, and existing HF tokens have permissions for buckets.",
      "evidence_type": "paraphrase",
      "confidence": "stated",
      "home_domain": "engineering-technology",
      "published_date": "2026-05-27"
    },
    {
      "id": 1780210521518088075,
      "text": "The bucket-based system composes natively with other Hugging Face stack components like Spaces and datasets.",
      "evidence_type": "paraphrase",
      "confidence": "stated",
      "home_domain": "engineering-technology",
      "published_date": "2026-05-27"
    },
    {
      "id": 1780210521522198417,
      "text": "The full architecture involves a Trainer, an HF Bucket, a vLLM rollout server, and an Environment.",
      "evidence_type": "paraphrase",
      "confidence": "stated",
      "home_domain": "engineering-technology",
      "published_date": "2026-05-27"
    },
    {
      "id": 1780210521530993954,
      "text": "The Trainer runs the optimizer and emits sparse deltas, located wherever desired.",
      "evidence_type": "paraphrase",
      "confidence": "stated",
      "home_domain": "engineering-technology",
      "published_date": "2026-05-27"
    },
    {
      "id": 1780210521534297749,
      "text": "The HF Bucket acts as the single shared substrate with `anchors/` for full snapshots and `deltas/` for sparse patches.",
      "evidence_type": "paraphrase",
      "confidence": "stated",
      "home_domain": "engineering-technology",
      "published_date": "2026-05-27"
    },
    {
      "id": 1780210521537027446,
      "text": "The vLLM rollout server pulls from the bucket, applies deltas, and serves rollouts, not necessarily co-located with the trainer.",
      "evidence_type": "paraphrase",
      "confidence": "stated",
      "home_domain": "engineering-technology",
      "published_date": "2026-05-27"
    },
    {
      "id": 1780210521542110327,
      "text": "The Environment hangs off the rollout server via HTTP or function calls.",
      "evidence_type": "paraphrase",
      "confidence": "stated",
      "home_domain": "engineering-technology",
      "published_date": "2026-05-27"
    },
    {
      "id": 1780210521558719500,
      "text": "The trainer and rollout server never talk to each other directly about weights, exchanging only a tiny POST containing repo_id and filename.",
      "evidence_type": "paraphrase",
      "confidence": "stated",
      "home_domain": "engineering-technology",
      "published_date": "2026-05-27"
    },
    {
      "id": 1780210521570068314,
      "text": "The actual byte transfer happens between each side and the bucket, in parallel, without a shared network fabric.",
      "evidence_type": "paraphrase",
      "confidence": "stated",
      "home_domain": "engineering-technology",
      "published_date": "2026-05-27"
    },
    {
      "id": 1780210521574392979,
      "text": "The rollout server can be in another region, cloud, or behind NAT inside a Hugging Face Space.",
      "evidence_type": "paraphrase",
      "confidence": "stated",
      "home_domain": "engineering-technology",
      "published_date": "2026-05-27"
    },
    {
      "id": 1780210521586155509,
      "text": "N inference replicas can pull the same delta from the same bucket, and Xet deduplicates bytes across them.",
      "evidence_type": "paraphrase",
      "confidence": "stated",
      "home_domain": "engineering-technology",
      "published_date": "2026-05-27"
    },
    {
      "id": 1780210521594400288,
      "text": "The trainer never needs to know the number or location of inference replicas, or if one crashed.",
      "evidence_type": "paraphrase",
      "confidence": "stated",
      "home_domain": "engineering-technology",
      "published_date": "2026-05-27"
    },
    {
      "id": 1780210521598578718,
      "text": "Safetensors is chosen as the on-disk and on-wire format for delta weight sync.",
      "evidence_type": "paraphrase",
      "confidence": "stated",
      "home_domain": "engineering-technology",
      "published_date": "2026-05-27"
    },
    {
      "id": 1780210521602447683,
      "text": "Safetensors is the canonical checkpoint format on the Hub and can be read by reasonable frameworks.",
      "evidence_type": "paraphrase",
      "confidence": "stated",
      "home_domain": "engineering-technology",
      "published_date": "2026-05-27"
    },
    {
      "id": 1780210521610294219,
      "text": "Safetensors headers carry arbitrary string metadata, used to hide the protocol.",
      "evidence_type": "paraphrase",
      "confidence": "stated",
      "home_domain": "engineering-technology",
      "published_date": "2026-05-27"
    },
    {
      "id": 1780210521614011861,
      "text": "Anchors are normal checkpoints with full bf16 weights, written every N (default 10) syncs.",
      "evidence_type": "paraphrase",
      "confidence": "stated",
      "home_domain": "engineering-technology",
      "published_date": "2026-05-27"
    },
    {
      "id": 1780210521622415166,
      "text": "Deltas store two entries for each changed parameter: a flat int32 tensor of element indices and a bf16 tensor of values.",
      "evidence_type": "paraphrase",
      "confidence": "stated",
      "home_domain": "engineering-technology",
      "published_date": "2026-05-27"
    },
    {
      "id": 1780210521634051803,
      "text": "A delta is a file that can be opened with `safe_open(...)` in Python and inspected.",
      "evidence_type": "paraphrase",
      "confidence": "stated",
      "home_domain": "engineering-technology",
      "published_date": "2026-05-27"
    },
    {
      "id": 1780210521638309278,
      "text": "Delta metadata is self-describing, allowing the receiver to branch based on `sparse=True/False`.",
      "evidence_type": "paraphrase",
      "confidence": "stated",
      "home_domain": "engineering-technology",
      "published_date": "2026-05-27"
    },
    {
      "id": 1780210521646910198,
      "text": "Delta files allow zero-copy via mmap on the inference side.",
      "evidence_type": "paraphrase",
      "confidence": "stated",
      "home_domain": "engineering-technology",
      "published_date": "2026-05-27"
    },
    {
      "id": 1780210521650175161,
      "text": "Each new inference replica needs to grab the most recent anchor and then replay the deltas since.",
      "evidence_type": "paraphrase",
      "confidence": "stated",
      "home_domain": "engineering-technology",
      "published_date": "2026-05-27"
    },
    {
      "id": 1780210521658577463,
      "text": "The trainer uses a `BF16ChangeDetector` with pre-step and post-step optimizer hooks to identify flipped bf16 elements.",
      "evidence_type": "paraphrase",
      "confidence": "stated",
      "home_domain": "engineering-technology",
      "published_date": "2026-05-27"
    },
    {
      "id": 1780210521666140998,
      "text": "The `BF16ChangeDetector` works by snapshotting bf16 weights before an optimizer step, then diffing after.",
      "evidence_type": "paraphrase",
      "confidence": "stated",
      "home_domain": "engineering-technology",
      "published_date": "2026-05-27"
    },
    {
      "id": 1780210521674983410,
      "text": "Predicting the change mask from Adam's mmm and vvv statistics resulted in a recall of around 30%.",
      "evidence_type": "paraphrase",
      "confidence": "measured",
      "home_domain": "engineering-technology",
      "published_date": "2026-05-27"
    },
    {
      "id": 1780210521682415651,
      "text": "The analytical bf16 threshold was not tight enough for accurate prediction of the change mask.",
      "evidence_type": "paraphrase",
      "confidence": "stated",
      "home_domain": "engineering-technology",
      "published_date": "2026-05-27"
    },
    {
      "id": 1780210521686035039,
      "text": "The system pays the cost of one bf16 CPU snapshot of the model on the trainer side for change detection.",
      "evidence_type": "paraphrase",
      "confidence": "stated",
      "home_domain": "engineering-technology",
      "published_date": "2026-05-27"
    },
    {
      "id": 1780210521691176474,
      "text": "The `new_sync_weight` flow involves uploading while inference runs, pausing vLLM, signaling/updating weights, and resuming.",
      "evidence_type": "paraphrase",
      "confidence": "stated",
      "home_domain": "engineering-technology",
      "published_date": "2026-05-27"
    },
    {
      "id": 1780210521693958488,
      "text": "Inference was paused for 1.1 seconds during a weight sync, while the total sync time was 9.4 seconds.",
      "evidence_type": "paraphrase",
      "confidence": "measured",
      "home_domain": "engineering-technology",
      "published_date": "2026-05-27"
    },
    {
      "id": 1780210521706654655,
      "text": "The remaining 9.4 seconds of sync time were spent uploading, which occurred in the background while the rollout server generated tokens.",
      "evidence_type": "paraphrase",
      "confidence": "stated",
      "home_domain": "engineering-technology",
      "published_date": "2026-05-27"
    },
    {
      "id": 1780210521710510010,
      "text": "With NCCL, the full sync time would be paid as pause time.",
      "evidence_type": "paraphrase",
      "confidence": "stated",
      "home_domain": "engineering-technology",
      "published_date": "2026-05-27"
    },
    {
      "id": 1780210521714813893,
      "text": "vLLM has a `WeightTransferEngine` abstraction for this.",
      "evidence_type": "paraphrase",
      "confidence": "stated",
      "home_domain": "engineering-technology",
      "published_date": "2026-05-27"
    },
    {
      "id": 1780210521722175942,
      "text": "A `DeltaWeightTransferEngine` is implemented whose `receive_weights` method downloads delta files, applies patches, and hands full tensors to vLLM.",
      "evidence_type": "paraphrase",
      "confidence": "stated",
      "home_domain": "engineering-technology",
      "published_date": "2026-05-27"
    },
    {
      "id": 1780210521726023843,
      "text": "The `DeltaWeightTransferEngine` registers via vLLM's `--worker-extension-cls` flag, requiring no vLLM fork.",
      "evidence_type": "paraphrase",
      "confidence": "stated",
      "home_domain": "engineering-technology",
      "published_date": "2026-05-27"
    },
    {
      "id": 1780210521730301933,
      "text": "vLLM has an in-flight effort (`vllm-project/vllm#40096`) to land native sparse weight transfer.",
      "evidence_type": "paraphrase",
      "confidence": "stated",
      "home_domain": "engineering-technology",
      "published_date": "2026-05-27"
    }
  ],
  "tags": [
    {
      "id": 17730933552304436,
      "slug": "cursor-ai-organization",
      "name": "Cursor AI",
      "type": "organization"
    },
    {
      "id": 17733519606213111,
      "slug": "fireworks-organization",
      "name": "Fireworks",
      "type": "organization"
    },
    {
      "id": 17726459355176948,
      "slug": "hugging-face-hub-organization",
      "name": "Hugging Face Hub",
      "type": "organization"
    },
    {
      "id": 17733541291350092,
      "slug": "ing-organization",
      "name": "ING",
      "type": "organization"
    },
    {
      "id": 17802103782911013,
      "slug": "amine-dirhoussi-person",
      "name": "Amine Dirhoussi",
      "type": "person"
    },
    {
      "id": 17791452103823983,
      "slug": "ai-infrastructure-topic",
      "name": "AI Infrastructure",
      "type": "topic"
    },
    {
      "id": 17723038993834764,
      "slug": "artificial-intelligence-topic",
      "name": "Artificial Intelligence",
      "type": "topic"
    },
    {
      "id": 17791452103543441,
      "slug": "gpu-clusters-topic",
      "name": "GPU Clusters",
      "type": "topic"
    },
    {
      "id": 17791452102628180,
      "slug": "inference-optimization-topic",
      "name": "Inference Optimization",
      "type": "topic"
    },
    {
      "id": 17782518580601405,
      "slug": "machine-learning-research-topic",
      "name": "Machine Learning Research",
      "type": "topic"
    }
  ]
}