← Stories · Brief

Shipping a Trillion Parameters With a Hub Bucket: Delta Weight Sync in TRL

huggingface-rl-blog engineering-technology May 27, 2026 source →
Claims
117
Domain
engineering-technology
Reading time
9 min
Record
Shipping a Trillion Parameters With a Hub Bucket: Delta Weig

Claims from this story

Every atomic assertion extracted from the underlying record, ranked by evidence strength.

For a frontier 1T model checkpoint, shipping the whole model is on the order of a terabyte per step.

direct_quotemeasuredengineering-technologyMay 27, 2026

Async RL has a dirty secret: every step, the trainer has to ship the whole model to the inference engine.

direct_quotestatedengineering-technologyMay 27, 2026

A TRL PR was landed that encodes just the changed elements as a sparse safetensors file.

direct_quotestatedengineering-technologyMay 27, 2026

A full disaggregated training was run where the trainer was on one box, vLLM lived in a Hugging Face Space, and the Wordle environment lived in another Space.

direct_quotestatedengineering-technologyMay 27, 2026

On Qwen3-0.6B, the per-step payload drops from 1.2 GB to 20 to 35 MB using delta weight sync.

direct_quotemeasuredengineering-technologyMay 27, 2026

The disaggregated training setup required no shared cluster, no RDMA, and no VPN.

direct_quotestatedengineering-technologyMay 27, 2026

Async RL training has become significantly cheaper due to delta weight sync.

direct_quotestatedengineering-technologyMay 27, 2026

The PULSE paper (Mihai & Belilovsky, 2026) formalizes the argument for bf16 weight sparsity.

direct_quotestatedengineering-technologyMay 27, 2026

The Python interface for Hugging Face Buckets uses `batch_bucket_files` and `download_bucket_files` functions.

direct_quotestatedengineering-technologyMay 27, 2026

The bf16 visibility threshold is |w|/256.

direct_quotemeasuredengineering-technologyMay 27, 2026

Between two consecutive RL optimizer steps, roughly 99% of bf16 weights are bit-identical.

direct_quotemeasuredengineering-technologyMay 27, 2026

The percentage of bit-identical bf16 weights between consecutive RL optimizer steps is never less than 98% in the worst case.

direct_quotemeasuredengineering-technologyMay 27, 2026

The actual delta (changed elements) between consecutive RL optimizer steps is tiny.

direct_quotestatedengineering-technologyMay 27, 2026

The sparse safetensors file is uploaded to a Hugging Face Bucket.

direct_quotestatedengineering-technologyMay 27, 2026

vLLM is instructed to fetch the sparse safetensors file from the Hugging Face Bucket.

direct_quotestatedengineering-technologyMay 27, 2026

For a 7B model in bf16, shipping the whole model is 14 GB per step.

direct_quotemeasuredengineering-technologyMay 27, 2026

Most of the weights have not actually changed between two adjacent RL steps.

direct_quotestatedengineering-technologyMay 27, 2026

Sending only the changed parts reduces bandwidth bill by roughly two orders of magnitude.

direct_quotestatedengineering-technologyMay 27, 2026

Routing tiny diffs through a shared object store eliminates the need for trainer and inference cluster to be in the same data center.

direct_quotestatedengineering-technologyMay 27, 2026

Weights flowed through a single Hub bucket in the disaggregated training setup.

direct_quotestatedengineering-technologyMay 27, 2026

The system observes which bytes flipped to determine the change mask, rather than predicting it analytically.

paraphrasestatedengineering-technologyMay 27, 2026

A Hugging Face Bucket is a repo type on the Hub designed for high-frequency object storage without commit ceremony or PR workflow.

paraphrasestatedengineering-technologyMay 27, 2026

Hugging Face Buckets are backed by Xet, the Hub's content-defined chunking storage layer.

paraphrasestatedengineering-technologyMay 27, 2026

Xet deduplicates uploaded files against everything already in the bucket by slicing them into content-defined chunks.

paraphrasestatedengineering-technologyMay 27, 2026

Even if full anchors were uploaded every step, Xet would only transfer the changed chunks.

paraphrasestatedengineering-technologyMay 27, 2026

The Hugging Face Bucket approach is an open-source equivalent of the 'shared S3 bucket' used by Fireworks and Cursor.

paraphrasestatedengineering-technologyMay 27, 2026

The Hub's storage layer (Xet) knows about content hashing, and existing HF tokens have permissions for buckets.

paraphrasestatedengineering-technologyMay 27, 2026

The bucket-based system composes natively with other Hugging Face stack components like Spaces and datasets.

paraphrasestatedengineering-technologyMay 27, 2026

The full architecture involves a Trainer, an HF Bucket, a vLLM rollout server, and an Environment.

paraphrasestatedengineering-technologyMay 27, 2026

The Trainer runs the optimizer and emits sparse deltas, located wherever desired.

paraphrasestatedengineering-technologyMay 27, 2026

The HF Bucket acts as the single shared substrate with `anchors/` for full snapshots and `deltas/` for sparse patches.

paraphrasestatedengineering-technologyMay 27, 2026

The vLLM rollout server pulls from the bucket, applies deltas, and serves rollouts, not necessarily co-located with the trainer.

paraphrasestatedengineering-technologyMay 27, 2026

The Environment hangs off the rollout server via HTTP or function calls.

paraphrasestatedengineering-technologyMay 27, 2026

The trainer and rollout server never talk to each other directly about weights, exchanging only a tiny POST containing repo_id and filename.

paraphrasestatedengineering-technologyMay 27, 2026

The actual byte transfer happens between each side and the bucket, in parallel, without a shared network fabric.

paraphrasestatedengineering-technologyMay 27, 2026

The rollout server can be in another region, cloud, or behind NAT inside a Hugging Face Space.

paraphrasestatedengineering-technologyMay 27, 2026

N inference replicas can pull the same delta from the same bucket, and Xet deduplicates bytes across them.

paraphrasestatedengineering-technologyMay 27, 2026

The trainer never needs to know the number or location of inference replicas, or if one crashed.

paraphrasestatedengineering-technologyMay 27, 2026

Safetensors is chosen as the on-disk and on-wire format for delta weight sync.

paraphrasestatedengineering-technologyMay 27, 2026

Safetensors is the canonical checkpoint format on the Hub and can be read by reasonable frameworks.

paraphrasestatedengineering-technologyMay 27, 2026

Safetensors headers carry arbitrary string metadata, used to hide the protocol.

paraphrasestatedengineering-technologyMay 27, 2026

Anchors are normal checkpoints with full bf16 weights, written every N (default 10) syncs.

paraphrasestatedengineering-technologyMay 27, 2026

Deltas store two entries for each changed parameter: a flat int32 tensor of element indices and a bf16 tensor of values.

paraphrasestatedengineering-technologyMay 27, 2026

A delta is a file that can be opened with `safe_open(...)` in Python and inspected.

paraphrasestatedengineering-technologyMay 27, 2026

Delta metadata is self-describing, allowing the receiver to branch based on `sparse=True/False`.

paraphrasestatedengineering-technologyMay 27, 2026

Delta files allow zero-copy via mmap on the inference side.

paraphrasestatedengineering-technologyMay 27, 2026

Each new inference replica needs to grab the most recent anchor and then replay the deltas since.

paraphrasestatedengineering-technologyMay 27, 2026

The trainer uses a `BF16ChangeDetector` with pre-step and post-step optimizer hooks to identify flipped bf16 elements.

paraphrasestatedengineering-technologyMay 27, 2026

The `BF16ChangeDetector` works by snapshotting bf16 weights before an optimizer step, then diffing after.

paraphrasestatedengineering-technologyMay 27, 2026

Predicting the change mask from Adam's mmm and vvv statistics resulted in a recall of around 30%.

paraphrasemeasuredengineering-technologyMay 27, 2026

The analytical bf16 threshold was not tight enough for accurate prediction of the change mask.

paraphrasestatedengineering-technologyMay 27, 2026

The system pays the cost of one bf16 CPU snapshot of the model on the trainer side for change detection.

paraphrasestatedengineering-technologyMay 27, 2026

The `new_sync_weight` flow involves uploading while inference runs, pausing vLLM, signaling/updating weights, and resuming.

paraphrasestatedengineering-technologyMay 27, 2026

Inference was paused for 1.1 seconds during a weight sync, while the total sync time was 9.4 seconds.

paraphrasemeasuredengineering-technologyMay 27, 2026

The remaining 9.4 seconds of sync time were spent uploading, which occurred in the background while the rollout server generated tokens.

paraphrasestatedengineering-technologyMay 27, 2026

With NCCL, the full sync time would be paid as pause time.

paraphrasestatedengineering-technologyMay 27, 2026

vLLM has a `WeightTransferEngine` abstraction for this.

paraphrasestatedengineering-technologyMay 27, 2026

A `DeltaWeightTransferEngine` is implemented whose `receive_weights` method downloads delta files, applies patches, and hands full tensors to vLLM.

paraphrasestatedengineering-technologyMay 27, 2026

The `DeltaWeightTransferEngine` registers via vLLM's `--worker-extension-cls` flag, requiring no vLLM fork.

paraphrasestatedengineering-technologyMay 27, 2026

vLLM has an in-flight effort (`vllm-project/vllm#40096`) to land native sparse weight transfer.

paraphrasestatedengineering-technologyMay 27, 2026