Back
CS336 Assignment 5: align language models with supervised fine-tuning and reinforcement learning (expert iteration, GRPO) to improve math reasoning.
language models
cs336
alignment
reinforcement learning
notes