← Stories · Brief

Building an RL Theorem-Proving Workflow on Modal

modal-labs-blog engineering-technology Apr 29, 2026 source →
Claims
99
Domain
engineering-technology
Reading time
7 min
Record
Building an RL Theorem-Proving Workflow on Modal

Claims from this story

Every atomic assertion extracted from the underlying record, ranked by evidence strength.

The `lean_server_image` uses `projectnumina/kimina-lean-server:2.0.0`.

direct_quotestatedengineering-technologyApr 29, 2026

The `gpu_image` uses `debian_slim` with Python 3.11 and `vllm`, `torch`, `transformers`, `datasets`.

direct_quotestatedengineering-technologyApr 29, 2026

The `orchestrator_image` uses `debian_slim` with Python 3.11 and `requests`, `tqdm`, `numpy`.

direct_quotestatedengineering-technologyApr 29, 2026

Using Modal, a sparse-reward RL workflow was run across three different runtimes without rebuilding the setup.

paraphrasestatedengineering-technologyApr 29, 2026

Running the entire workflow on Modal allowed AE Studio to focus on the experiment instead of infrastructure.

paraphrasestatedengineering-technologyApr 29, 2026

Early results showed ES matched or outperformed GRPO in verified proofs per iteration in several runs.

paraphrasestatedengineering-technologyApr 29, 2026

Modal reduced wasted GPU time by approximately 3.7x compared to less elastic platforms.

paraphrasestatedengineering-technologyApr 29, 2026

Reduced complexity on Modal translated to completing a successful training run in less than two days from project kickoff.

paraphrasestatedengineering-technologyApr 29, 2026

AE Studio wanted to test Evolution Strategies (ES) as an alternative to GRPO.

paraphrasestatedengineering-technologyApr 29, 2026

ES takes an approach inspired by natural selection, creating a "population" of slightly different model versions.

paraphrasestatedengineering-technologyApr 29, 2026

ES tests all versions in the population and then steers the original model toward the best-scoring versions.

paraphrasestatedengineering-technologyApr 29, 2026

Recent research has shown ES can outperform GRPO in some settings.

paraphrasestatedengineering-technologyApr 29, 2026

AE Studio aimed to replicate ES's performance for theorem-proving as a first step to accelerating AI-enabled science.

paraphrasestatedengineering-technologyApr 29, 2026

For a language model to prove a theorem, it needs to generate 'code' in a specialized language like Lean.

paraphrasestatedengineering-technologyApr 29, 2026

The Lean compiler can verify if a generated proof is correct.

paraphrasestatedengineering-technologyApr 29, 2026

Code generation by the LLM is GPU/inference heavy.

paraphrasestatedengineering-technologyApr 29, 2026

Proof verification by the Lean compiler runs on the CPU.

paraphrasestatedengineering-technologyApr 29, 2026

The workload required three different execution environments: GPU for generation, CPU for verification, and a lightweight process for coordination.

paraphrasestatedengineering-technologyApr 29, 2026

A vLLM instance running on GPUs is used for generating proof attempts.

paraphrasestatedengineering-technologyApr 29, 2026

Each proof is sent to a Lean verifier running on CPUs, which needs to be isolated.

paraphrasestatedengineering-technologyApr 29, 2026

A lightweight process supervises the training loop, sending batches, collecting results, and tracking progress.

paraphrasestatedengineering-technologyApr 29, 2026

Setting up this system from scratch would involve managing multiple server environments, a job scheduling system, storage for model checkpoints, and a robust verification service.

paraphrasestatedengineering-technologyApr 29, 2026

Modal's per-function images allow each step (GPU generation, Lean verification, orchestration) to declare its own environment.

paraphrasestatedengineering-technologyApr 29, 2026

Modal's `.map()` feature enables fanning out many independent evaluations per ES iteration and streaming results.

paraphrasestatedengineering-technologyApr 29, 2026

Modal Sandboxes provide isolated, short-lived Lean servers for each verification batch, preventing failures from affecting the whole run.

paraphrasestatedengineering-technologyApr 29, 2026

Modal Volumes store the original base model weights, allowing GPU workers to load them without repeated downloads from Hugging Face.

paraphrasestatedengineering-technologyApr 29, 2026

Modal Secrets inject credentials into remote functions without leaking them into local shell state.

paraphrasestatedengineering-technologyApr 29, 2026

GPU sandboxes take time to warm up, and remote debugging is challenging, which are inherent challenges in bursty multi-GPU experiments.

paraphrasestatedengineering-technologyApr 29, 2026

Modal features help reduce cold starts and make remote iteration easier, improving workflow manageability.

paraphrasestatedengineering-technologyApr 29, 2026

The goal was to train a model to be good at math theorem proving using Lean.

paraphrasestatedengineering-technologyApr 29, 2026

Lean provides a verifiable reward for the training loop based on proof correctness.

paraphrasestatedengineering-technologyApr 29, 2026

The training loop was fixed, and only the update rule was changed to compare GRPO and ES performance.

paraphrasestatedengineering-technologyApr 29, 2026

GRPO applies a gradient update based on relative performance within groups of proof attempts.

paraphrasestatedengineering-technologyApr 29, 2026

ES evaluates a population of perturbed models, scores each by proof success rate, and updates the base model using a weighted combination of perturbations based on rewards.

paraphrasestatedengineering-technologyApr 29, 2026

Each compute role was given its own image on Modal (GPU, orchestrator, Lean server).

paraphrasestatedengineering-technologyApr 29, 2026

ES was the easiest part of the workload to distribute.

paraphrasestatedengineering-technologyApr 29, 2026

Each perturbation evaluation for ES required the current checkpoint, theorem batch, perturbation seed, and generation parameters.

paraphrasestatedengineering-technologyApr 29, 2026

Modal's `.map()` was used for parallel GPU fan-out in ES.

paraphrasestatedengineering-technologyApr 29, 2026

With ES, weight perturbations are fully determined by their seeds, avoiding expensive weight transfers between GPUs.

paraphrasestatedengineering-technologyApr 29, 2026

Each worker reconstructs the current model from base weights and applies its perturbation.

paraphrasestatedengineering-technologyApr 29, 2026

Verification was the part of the system where isolation mattered most due to potential hangs, crashes, or resource consumption.

paraphrasestatedengineering-technologyApr 29, 2026

Modal Sandboxes were used for each verification batch, starting a Lean server, sending proofs, collecting results, and shutting down.

paraphrasestatedengineering-technologyApr 29, 2026

Proofs were verified in parallel using a fan-out pattern with `verify_batch_in_sandbox.map()`.

paraphrasestatedengineering-technologyApr 29, 2026

One iteration created 3,840 proof attempts, split into batches of 64.

paraphrasestatedengineering-technologyApr 29, 2026

Modal's sandbox model allows scaling verification much further, potentially running each proof attempt in its own sandbox.

paraphrasestatedengineering-technologyApr 29, 2026

The entire model state in ES can be described by the base model plus a list of (seed, reward) pairs.

paraphrasestatedengineering-technologyApr 29, 2026

Each perturbation is generated from a deterministic random seed, eliminating the need to store noise vectors.

paraphrasestatedengineering-technologyApr 29, 2026

The orchestrator maintained and passed a running list of seed/reward entries to each worker.

paraphrasestatedengineering-technologyApr 29, 2026

Each history entry for checkpointing was about 200 bytes per iteration.

paraphrasestatedengineering-technologyApr 29, 2026

On the GPU side, workers loaded the original base model from a Modal Volume.

paraphrasestatedengineering-technologyApr 29, 2026

Workers replayed the full history to reconstruct current weights before applying their own perturbation.

paraphrasestatedengineering-technologyApr 29, 2026

The replay process regenerates noise on GPU using deterministic seeds and applies weighted updates.

paraphrasestatedengineering-technologyApr 29, 2026

The base model lives in a Modal Volume, preventing re-downloading from Hugging Face for each worker.

paraphrasestatedengineering-technologyApr 29, 2026

The full model state traveled as a plain Python list, small enough to pass as a function argument to every remote call.

paraphrasestatedengineering-technologyApr 29, 2026

Modal provided an ideal balance of speed, simplicity, and cost for this experiment.

paraphrasestatedengineering-technologyApr 29, 2026

The Modal implementation required approximately 250 lines of platform setup code.

paraphrasestatedengineering-technologyApr 29, 2026

Similar experiments on other platforms typically require about 600 lines of setup code.

paraphrasestatedengineering-technologyApr 29, 2026

This completion time is 60% less than seen when using alternative platforms.

paraphrasestatedengineering-technologyApr 29, 2026

Faster iteration speed is expected for tweaking and optimizing the training pipeline on Modal.

paraphrasestatedengineering-technologyApr 29, 2026

Runtime efficiency also improved significantly with Modal.

paraphrasestatedengineering-technologyApr 29, 2026