Every atomic assertion extracted from the underlying record, ranked by evidence strength.
The `lean_server_image` uses `projectnumina/kimina-lean-server:2.0.0`.
The `gpu_image` uses `debian_slim` with Python 3.11 and `vllm`, `torch`, `transformers`, `datasets`.
The `orchestrator_image` uses `debian_slim` with Python 3.11 and `requests`, `tqdm`, `numpy`.
Using Modal, a sparse-reward RL workflow was run across three different runtimes without rebuilding the setup.
Running the entire workflow on Modal allowed AE Studio to focus on the experiment instead of infrastructure.
Early results showed ES matched or outperformed GRPO in verified proofs per iteration in several runs.
Modal reduced wasted GPU time by approximately 3.7x compared to less elastic platforms.
Reduced complexity on Modal translated to completing a successful training run in less than two days from project kickoff.
AE Studio wanted to test Evolution Strategies (ES) as an alternative to GRPO.
ES takes an approach inspired by natural selection, creating a "population" of slightly different model versions.
ES tests all versions in the population and then steers the original model toward the best-scoring versions.
Recent research has shown ES can outperform GRPO in some settings.
AE Studio aimed to replicate ES's performance for theorem-proving as a first step to accelerating AI-enabled science.
For a language model to prove a theorem, it needs to generate 'code' in a specialized language like Lean.
The Lean compiler can verify if a generated proof is correct.
Code generation by the LLM is GPU/inference heavy.
Proof verification by the Lean compiler runs on the CPU.
The workload required three different execution environments: GPU for generation, CPU for verification, and a lightweight process for coordination.
A vLLM instance running on GPUs is used for generating proof attempts.
Each proof is sent to a Lean verifier running on CPUs, which needs to be isolated.
A lightweight process supervises the training loop, sending batches, collecting results, and tracking progress.
Setting up this system from scratch would involve managing multiple server environments, a job scheduling system, storage for model checkpoints, and a robust verification service.
Modal's per-function images allow each step (GPU generation, Lean verification, orchestration) to declare its own environment.
Modal's `.map()` feature enables fanning out many independent evaluations per ES iteration and streaming results.
Modal Sandboxes provide isolated, short-lived Lean servers for each verification batch, preventing failures from affecting the whole run.
Modal Volumes store the original base model weights, allowing GPU workers to load them without repeated downloads from Hugging Face.
Modal Secrets inject credentials into remote functions without leaking them into local shell state.
GPU sandboxes take time to warm up, and remote debugging is challenging, which are inherent challenges in bursty multi-GPU experiments.
Modal features help reduce cold starts and make remote iteration easier, improving workflow manageability.
The goal was to train a model to be good at math theorem proving using Lean.
Lean provides a verifiable reward for the training loop based on proof correctness.
The training loop was fixed, and only the update rule was changed to compare GRPO and ES performance.
GRPO applies a gradient update based on relative performance within groups of proof attempts.
ES evaluates a population of perturbed models, scores each by proof success rate, and updates the base model using a weighted combination of perturbations based on rewards.
Each compute role was given its own image on Modal (GPU, orchestrator, Lean server).
ES was the easiest part of the workload to distribute.
Each perturbation evaluation for ES required the current checkpoint, theorem batch, perturbation seed, and generation parameters.
Modal's `.map()` was used for parallel GPU fan-out in ES.
With ES, weight perturbations are fully determined by their seeds, avoiding expensive weight transfers between GPUs.
Each worker reconstructs the current model from base weights and applies its perturbation.
Verification was the part of the system where isolation mattered most due to potential hangs, crashes, or resource consumption.
Modal Sandboxes were used for each verification batch, starting a Lean server, sending proofs, collecting results, and shutting down.
Proofs were verified in parallel using a fan-out pattern with `verify_batch_in_sandbox.map()`.
One iteration created 3,840 proof attempts, split into batches of 64.
Modal's sandbox model allows scaling verification much further, potentially running each proof attempt in its own sandbox.
The entire model state in ES can be described by the base model plus a list of (seed, reward) pairs.
Each perturbation is generated from a deterministic random seed, eliminating the need to store noise vectors.
The orchestrator maintained and passed a running list of seed/reward entries to each worker.
Each history entry for checkpointing was about 200 bytes per iteration.
On the GPU side, workers loaded the original base model from a Modal Volume.
Workers replayed the full history to reconstruct current weights before applying their own perturbation.
The replay process regenerates noise on GPU using deterministic seeds and applies weighted updates.
The base model lives in a Modal Volume, preventing re-downloading from Hugging Face for each worker.
The full model state traveled as a plain Python list, small enough to pass as a function argument to every remote call.
Modal provided an ideal balance of speed, simplicity, and cost for this experiment.
The Modal implementation required approximately 250 lines of platform setup code.
Similar experiments on other platforms typically require about 600 lines of setup code.
This completion time is 60% less than seen when using alternative platforms.
Faster iteration speed is expected for tweaking and optimizing the training pipeline on Modal.
Runtime efficiency also improved significantly with Modal.