This post introduces Reinforcement Learning with Verifiable Rewards (RLVR) and Group Relative Policy Optimization (GRPO) to address reward signal reliability challenges in large language model training. The approach uses objective, rule-based feedback and group-aware optimization to improve model performance, particularly in tasks with verifiable outputs like mathematical reasoning. Applied to the GSM8K dataset, the method significantly enhanced the accuracy of a Qwen2.5-0.5B model, demonstrating a 3.7x improvement.
Every atomic assertion extracted from the underlying record, ranked by evidence strength.
Combining Reinforcement Learning with Verifiable Rewards (RLVR) with Group Relative Policy Optimization (GRPO) creates a framework where automated rewards guide learning while group-relative optimization helps drive balanced performance.
Reinforcement Learning with Verifiable Rewards (RLVR) addresses reward hacking through rule-based feedback defined by the model tuner.
Performance peaked at 8-shot context (41% accuracy).
The Group Relative Policy Optimization (GRPO)-trained model demonstrated a 3.7x improvement in chain-of-thought mathematical reasoning compared to the base model.
The Reinforcement Learning with Verifiable Rewards (RLVR) approach generalizes to domains with objectively verifiable outputs beyond mathematical reasoning.
Hidden biases, unintended incentives, and ambiguous success criteria can lead to models that behave unpredictably or fail to meet desired objectives.
Reinforcement learning with verifiable rewards (RLVR) introduces verification and transparency into reward signals to improve training performance.
RLVR works best when outputs can be objectively verified for correctness, such as in mathematical reasoning, code generation, or symbolic manipulation tasks.
Techniques like Group Relative Policy Optimization (GRPO) and few-shot examples can further improve results.
The GSM8K dataset (Grade School Math 8K) is used to improve math problem solving accuracy in this post.
Reinforcement Learning (RL) addresses challenges in model training by establishing a structured feedback system through reward signals.
Reinforcement Learning (RL) enables models to learn through interaction, receiving feedback that guides them toward optimal behavior.
Reinforcement Learning (RL) provides a framework for models to iteratively improve their responses based on clearly defined signals about the quality of their outputs.
Reinforcement Learning (RL) is highly effective for training models that interact with users and must adapt their behavior based on outcomes.
When reward functions are imprecise or incomplete, models can engage in "reward hacking," finding unintended ways to maximize scores without achieving the desired behavior.
Reinforcement Learning with Verifiable Rewards (RLVR) uses programmatic reward functions that automatically score outputs against specific criteria.
Reinforcement Learning with Verifiable Rewards (RLVR) enables rapid iteration without the bottleneck of collecting human ratings.
Verifiable rewards come from objective, reproducible rules, making RLVR ideal for evolving requirements.
Reinforcement Learning with Verifiable Rewards (RLVR) learns general optimization strategies and adapts quickly to new scenarios.
Group Relative Policy Optimization (GRPO) is a reinforcement learning algorithm that improves AI model learning by comparing performance within groups rather than across all data at once.
Group Relative Policy Optimization (GRPO) organizes training data into meaningful groups and optimizes performance relative to each group's baseline.
Group Relative Policy Optimization (GRPO)'s group-aware optimization reduces training variance, accelerates convergence, and can produce models that perform consistently across various categories.
Reward functions for different task aspects can be defined, and Group Relative Policy Optimization (GRPO) treats these as distinct groups during training, facilitating simultaneous improvement across dimensions.
The combination of Reinforcement Learning with Verifiable Rewards (RLVR) and Group Relative Policy Optimization (GRPO) delivers rapid adaptation and robust performance, ideal for dynamic environments requiring generalization beyond training distribution.
Adding few-shot learning enhances the Reinforcement Learning with Verifiable Rewards (RLVR) and Group Relative Policy Optimization (GRPO) framework in three ways.
Few-shot examples provide templates that show the model what good outputs look like, narrowing the search space for exploration.
Group Relative Policy Optimization (GRPO) leverages few-shot examples by generating multiple candidate responses per prompt and learning from their relative performance within each group.
Verifiable rewards immediately confirm which approaches succeed.
The combination of few-shot examples, Group Relative Policy Optimization (GRPO), and verifiable rewards accelerates learning.
The model starts with concrete examples of the desired format, explores variations efficiently through group-based comparison, and receives definitive feedback on correctness.
A Qwen2.5-0.5B model is fine-tuned on SageMaker AI using Amazon SageMaker Training Jobs.
Amazon SageMaker Training jobs support distributed multi-GPU and multi-node configurations.
Amazon SageMaker Training jobs allow spinning up high-performance clusters on demand, training billion-parameter models faster, and automatically shutting down resources.
Code generation tasks may require a larger model like Qwen2.5-Coder-7B and subsequently larger training instances.
An AWS account is required to run the example from this post on Amazon SageMaker AI.
An AWS Identity and Access Management (IAM) role is required to access SageMaker AI.
The notebook provided can be run from preferred development environments like PyCharm or Visual Studio Code if AWS credentials are set up.
Amazon SageMaker Studio can be used for a straightforward development process on SageMaker AI.
An ml.p4d.24xlarge instance training is needed to follow along with this post's example.
Access to the GitHub repo `https://github.com/aws-samples/amazon-sagemaker-generativeai` is required.
To use SageMaker Studio JupyterLab spaces, launch an ml.t3.medium JupyterLab notebook instance with at least 50 GB of storage.
The fine-tuning job will run on a separate ephemeral training instance with GPU acceleration.
The Group Relative Policy Optimization (GRPO) implementation for mathematical reasoning employs a dual-reward system that provides objective, verifiable feedback during training.
The dual-reward system leverages the inherent verifiability of mathematical problems to create reliable training signals without requiring human annotation or subjective evaluation.
The Format Reward Function helps verify the model learns to structure its responses correctly by pattern matching for '#### The final answer is [number]'.
The Format Reward Function awards 0.5 points for proper formatting and 0.0 for incorrect format.
The Correctness Reward Function provides core mathematical verification by extracting numerical answers from formatted responses.
The Correctness Reward Function normalizes answers by removing common formatting characters like commas, currency symbols, and units.
The Correctness Reward Function uses a tolerance of 1e-3 to handle floating-point precision.
The Correctness Reward Function awards 1.0 for correct answers and 0.0 for incorrect ones.
During training, Group Relative Policy Optimization (GRPO) uses reward functions to compute policy gradients.
The model generates multiple completions for each mathematical problem.
The reward for each response is computed for both reward functions.
The format reward function grants up to 0.5 for proper response structure.
The correctness reward function grants up to 1.0 for the mathematical accuracy of the answer.
A maximum combined reward of 1.5 per completion is possible.
Group Relative Policy Optimization (GRPO) compares completions within groups to identify the best responses.
In the policy update step, the loss function uses reward differences to update model parameters.
Higher-rewarded completions increase their probability, while lower-rewarded completions decrease their probability.
This relative ranking drives the optimization process.