Every atomic assertion extracted from the underlying record, ranked by evidence strength.
For a frontier 1T model checkpoint, shipping the whole model is on the order of a terabyte per step.
Async RL has a dirty secret: every step, the trainer has to ship the whole model to the inference engine.
A TRL PR was landed that encodes just the changed elements as a sparse safetensors file.
A full disaggregated training was run where the trainer was on one box, vLLM lived in a Hugging Face Space, and the Wordle environment lived in another Space.
On Qwen3-0.6B, the per-step payload drops from 1.2 GB to 20 to 35 MB using delta weight sync.
The disaggregated training setup required no shared cluster, no RDMA, and no VPN.
Async RL training has become significantly cheaper due to delta weight sync.
The PULSE paper (Mihai & Belilovsky, 2026) formalizes the argument for bf16 weight sparsity.
The Python interface for Hugging Face Buckets uses `batch_bucket_files` and `download_bucket_files` functions.
The bf16 visibility threshold is |w|/256.
Between two consecutive RL optimizer steps, roughly 99% of bf16 weights are bit-identical.
The percentage of bit-identical bf16 weights between consecutive RL optimizer steps is never less than 98% in the worst case.
The actual delta (changed elements) between consecutive RL optimizer steps is tiny.
The sparse safetensors file is uploaded to a Hugging Face Bucket.
vLLM is instructed to fetch the sparse safetensors file from the Hugging Face Bucket.
For a 7B model in bf16, shipping the whole model is 14 GB per step.
Most of the weights have not actually changed between two adjacent RL steps.
Sending only the changed parts reduces bandwidth bill by roughly two orders of magnitude.
Routing tiny diffs through a shared object store eliminates the need for trainer and inference cluster to be in the same data center.
Weights flowed through a single Hub bucket in the disaggregated training setup.
The system observes which bytes flipped to determine the change mask, rather than predicting it analytically.
A Hugging Face Bucket is a repo type on the Hub designed for high-frequency object storage without commit ceremony or PR workflow.
Hugging Face Buckets are backed by Xet, the Hub's content-defined chunking storage layer.
Xet deduplicates uploaded files against everything already in the bucket by slicing them into content-defined chunks.
Even if full anchors were uploaded every step, Xet would only transfer the changed chunks.
The Hugging Face Bucket approach is an open-source equivalent of the 'shared S3 bucket' used by Fireworks and Cursor.
The Hub's storage layer (Xet) knows about content hashing, and existing HF tokens have permissions for buckets.
The bucket-based system composes natively with other Hugging Face stack components like Spaces and datasets.
The full architecture involves a Trainer, an HF Bucket, a vLLM rollout server, and an Environment.
The Trainer runs the optimizer and emits sparse deltas, located wherever desired.
The HF Bucket acts as the single shared substrate with `anchors/` for full snapshots and `deltas/` for sparse patches.
The vLLM rollout server pulls from the bucket, applies deltas, and serves rollouts, not necessarily co-located with the trainer.
The Environment hangs off the rollout server via HTTP or function calls.
The trainer and rollout server never talk to each other directly about weights, exchanging only a tiny POST containing repo_id and filename.
The actual byte transfer happens between each side and the bucket, in parallel, without a shared network fabric.
The rollout server can be in another region, cloud, or behind NAT inside a Hugging Face Space.
N inference replicas can pull the same delta from the same bucket, and Xet deduplicates bytes across them.
The trainer never needs to know the number or location of inference replicas, or if one crashed.
Safetensors is chosen as the on-disk and on-wire format for delta weight sync.
Safetensors is the canonical checkpoint format on the Hub and can be read by reasonable frameworks.
Safetensors headers carry arbitrary string metadata, used to hide the protocol.
Anchors are normal checkpoints with full bf16 weights, written every N (default 10) syncs.
Deltas store two entries for each changed parameter: a flat int32 tensor of element indices and a bf16 tensor of values.
A delta is a file that can be opened with `safe_open(...)` in Python and inspected.
Delta metadata is self-describing, allowing the receiver to branch based on `sparse=True/False`.
Delta files allow zero-copy via mmap on the inference side.
Each new inference replica needs to grab the most recent anchor and then replay the deltas since.
The trainer uses a `BF16ChangeDetector` with pre-step and post-step optimizer hooks to identify flipped bf16 elements.
The `BF16ChangeDetector` works by snapshotting bf16 weights before an optimizer step, then diffing after.
Predicting the change mask from Adam's mmm and vvv statistics resulted in a recall of around 30%.
The analytical bf16 threshold was not tight enough for accurate prediction of the change mask.
The system pays the cost of one bf16 CPU snapshot of the model on the trainer side for change detection.
The `new_sync_weight` flow involves uploading while inference runs, pausing vLLM, signaling/updating weights, and resuming.
Inference was paused for 1.1 seconds during a weight sync, while the total sync time was 9.4 seconds.
The remaining 9.4 seconds of sync time were spent uploading, which occurred in the background while the rollout server generated tokens.
With NCCL, the full sync time would be paid as pause time.
vLLM has a `WeightTransferEngine` abstraction for this.
A `DeltaWeightTransferEngine` is implemented whose `receive_weights` method downloads delta files, applies patches, and hands full tensors to vLLM.
The `DeltaWeightTransferEngine` registers via vLLM's `--worker-extension-cls` flag, requiring no vLLM fork.
vLLM has an in-flight effort (`vllm-project/vllm#40096`) to land native sparse weight transfer.