“Reinforcement learning via self-play” is key to reasoning in language models

Tanveer February 14, 2025

[ad_1]

A team of researchers from MIT, Cornell University, the University of Washington and Microsoft Research developed a framework called “Reinforcement Learning via Self-Play” (RLSP) that teaches large language models to spend more time working through problems. The approach mirrors techniques used in successful AI models like OpenAI’s o1, o3, Deepseek’s R1 and Google’s Gemini.

RLSP works in three stages: First, the model learns from examples of human or AI reasoning (SFT). Then, it’s rewarded for exploring different approaches to problems (RL). Finally, the system checks answers to ensure accuracy and prevent shortcuts (Verifier).

Testing shows promising results. When applied to Llama models, RLSP improved scores on the MATH 500 dataset by 23%. The Qwen2.5-32B-Instruct model saw a 10% boost on AIME 2024 math problems. Even with basic rewards for showing work, the models developed interesting behaviors like backtracking, exploring multiple solutions, and double-checking their answers.

These results are largely in line with findings reported by the team behind Deepseek R1 and R1-Zero, as well as recently by researchers from IN.AI, Tsinghua University and Carnegie Mellon University.

RLSP overview — In this presentation, the team shows that RLSP does not yet lead to higher forms of reasoning in their experiments.

The most notable finding isn’t just better test scores – it’s how the models learn to solve problems. Even without specific training examples, but with small rewards for exploration, the models developed several useful behaviors across different types of problems.

Contents hide

Why RLSP works

Questions for future research

Why RLSP works

The researchers think they know why this works: Recent studies show that “chain-of-thought” reasoning – where models write out their thinking step by step – gives them more computational power to solve problems. RLSP encourages models to create new reasoning paths through “self-play,” similar to how AI learned to master games like chess and Go.

The reward system encourages models to show all their work, even when some approaches don’t lead to the right answer. When a model finds the correct solution through a longer reasoning process, it gets full credit. This generates new examples of step-by-step reasoning that help the model improve.

Questions for future research

The team notes several remaining challenges. They want to know how models could adjust their thinking time based on problem difficulty – spending less time on simple math and more on complex proofs. They’re also curious about how context length affects reasoning and whether these behaviors truly go beyond what models see in training data.

Other open questions include whether pure reinforcement learning without exploration rewards could improve reasoning in larger models, and what additional training methods might help models develop higher-level thinking skills like forming theorems and tackling open-ended problems.

Recommendation

[ad_2]

Source link

Tanveer

Pro AI Tools is a seasoned expert in the field of artificial intelligence and technology. With a passion for innovation and a keen understanding of AI's transformative power, they have dedicated their career to exploring and sharing insights into cutting-edge tools and technologies.Drawing from extensive experience in the tech industry, Pro AI Tools is committed to providing valuable resources and comprehensive reviews to help individuals and businesses leverage AI for enhanced productivity and success. Their expertise spans a wide range of AI applications, from machine learning and natural language processing to automation and data analysis.Pro AI Tools believes in the potential of technology to drive positive change and is dedicated to making complex concepts accessible to a broad audience. Through their website, ProAITools.tech, they aim to empower users with the knowledge and tools needed to stay at the forefront of AI advancements.When not immersed in the latest tech developments, Pro AI Tools enjoys exploring new technologies, attending industry conferences, and sharing insights with a community of tech enthusiasts.

View all posts

Pro AI Tools

Pro AI Tools

“Reinforcement learning via self-play” is key to reasoning in language models

Why RLSP works

Questions for future research

Tanveer

Tata Communications, CoRover.ai Partner to Bring AI Solutions for Government and Enterprises

UAE to invest up to 50 billion euros in AI campus in France

A tale of two suckers: Donald Trump’s plastic straws and Keir Starmer | Stewart Lee

‘Boomer prompts’ don’t work with its new reasoning models, OpenAI says

Recent Posts

Archives

Categories

Meta

“Reinforcement learning via self-play” is key to reasoning in language models

Why RLSP works

Questions for future research

Tanveer

You Might Also Like

Tata Communications, CoRover.ai Partner to Bring AI Solutions for Government and Enterprises

UAE to invest up to 50 billion euros in AI campus in France

A tale of two suckers: Donald Trump’s plastic straws and Keir Starmer | Stewart Lee

‘Boomer prompts’ don’t work with its new reasoning models, OpenAI says

Recent Posts

Archives

Categories

Meta