Shipping a Trillion Parameters With a Hub Bucket: Delta Weight Sync in TRL

For a frontier 1T model checkpoint, shipping the whole model is on the order of a terabyte per step.

direct_quotemeasuredengineering-technologyMay 27, 2026

Async RL has a dirty secret: every step, the trainer has to ship the whole model to the inference engine.

direct_quotestatedengineering-technologyMay 27, 2026

A TRL PR was landed that encodes just the changed elements as a sparse safetensors file.

direct_quotestatedengineering-technologyMay 27, 2026

A full disaggregated training was run where the trainer was on one box, vLLM lived in a Hugging Face Space, and the Wordle environment lived in another Space.

direct_quotestatedengineering-technologyMay 27, 2026

On Qwen3-0.6B, the per-step payload drops from 1.2 GB to 20 to 35 MB using delta weight sync.

direct_quotemeasuredengineering-technologyMay 27, 2026

The disaggregated training setup required no shared cluster, no RDMA, and no VPN.

direct_quotestatedengineering-technologyMay 27, 2026

Async RL training has become significantly cheaper due to delta weight sync.

direct_quotestatedengineering-technologyMay 27, 2026

The PULSE paper (Mihai & Belilovsky, 2026) formalizes the argument for bf16 weight sparsity.

direct_quotestatedengineering-technologyMay 27, 2026

The Python interface for Hugging Face Buckets uses `batch_bucket_files` and `download_bucket_files` functions.

direct_quotestatedengineering-technologyMay 27, 2026

The bf16 visibility threshold is |w|/256.

direct_quotemeasuredengineering-technologyMay 27, 2026

Between two consecutive RL optimizer steps, roughly 99% of bf16 weights are bit-identical.

direct_quotemeasuredengineering-technologyMay 27, 2026

The percentage of bit-identical bf16 weights between consecutive RL optimizer steps is never less than 98% in the worst case.

direct_quotemeasuredengineering-technologyMay 27, 2026

The actual delta (changed elements) between consecutive RL optimizer steps is tiny.

direct_quotestatedengineering-technologyMay 27, 2026

The sparse safetensors file is uploaded to a Hugging Face Bucket.

direct_quotestatedengineering-technologyMay 27, 2026

vLLM is instructed to fetch the sparse safetensors file from the Hugging Face Bucket.

direct_quotestatedengineering-technologyMay 27, 2026

For a 7B model in bf16, shipping the whole model is 14 GB per step.

direct_quotemeasuredengineering-technologyMay 27, 2026

Most of the weights have not actually changed between two adjacent RL steps.

direct_quotestatedengineering-technologyMay 27, 2026

Sending only the changed parts reduces bandwidth bill by roughly two orders of magnitude.

direct_quotestatedengineering-technologyMay 27, 2026

Routing tiny diffs through a shared object store eliminates the need for trainer and inference cluster to be in the same data center.

direct_quotestatedengineering-technologyMay 27, 2026

Weights flowed through a single Hub bucket in the disaggregated training setup.

direct_quotestatedengineering-technologyMay 27, 2026

The system observes which bytes flipped to determine the change mask, rather than predicting it analytically.