Openhands Critic 32b Exp 20250417

Developed by all-hands

A review model fine-tuned based on Qwen2.5-Coder-32B-Instruct for evaluating code solution quality, helping achieve SOTA results on the SWE-Bench benchmark

Large Language Model

Safetensors

EnglishOpen Source License:MIT #Code Patch Review #Software Engineering Optimization #Multi-Trajectory Selection

Downloads 194

Release Time : 4/16/2025

Model Overview

A review model specifically designed for software engineering tasks, assessing code patch quality through temporal difference learning objectives and supporting multi-trajectory optimal selection

Model Features

Inference-Time Scaling Optimization

By generating multiple solutions and selecting the best one, SWE-Bench performance improved from 60.6% to 66.4%

Temporal Difference Learning

Uses TD learning objectives to backpropagate unit test signals across the entire trajectory for precise reward prediction

Real-World Generalization

Compared to prompt engineering solutions, the trained review model can generalize to software engineering scenarios beyond SWE-Bench

Model Capabilities

Code Quality Assessment

Multi-Solution Selection

Software Issue Resolution

Unit Test Pass Rate Prediction

Use Cases

Software Development Assistance

SWE-Bench Issue Resolution

Evaluates the quality of code patches for real GitHub issues

Achieved a 66.4% pass rate on the SWE-Bench Verified benchmark

Programming Agent Optimization

Provides intermediate reward signals for OpenHands agents

Supports real-time error recovery and single-step lookahead sampling

🚀 OpenHands Critic Model

OpenHands Critic Model achieves state - of - the - art results on SWE - Bench Verified, offering a new approach to software engineering tasks with inference - time scaling and a dedicated critic model.

🚀 Quick Start

If you're eager to try OpenHands:

Start with OpenHands Cloud: The simplest way is to use our fully managed cloud solution. It comes with $50 free credits, seamless GitHub integration, mobile support, and optimizations like context condensation.
Contribute to Open Source: You can star, open issues, or send PRs to our GitHub repository to contribute to open - source AI software development.
Join Our Community: Connect with us on Slack, read our documentation, and keep up with our latest progress.

✨ Features

SOTA on SWE - Bench Verified

OpenHands has reached a new milestone by achieving state - of - the - art results on SWE - Bench Verified. OpenHands Performance on Leaderboard

Inference - Time Scaling

Our approach leverages the idea of trying multiple solutions for challenging software engineering tasks and picking the best one. The steps are as follows:

Run the OpenHands agent multiple times using claude 3.7 sonnet with sampling temperature 1.0 for each SWE - Bench problem to generate multiple code patches.
Train a "critic model" to evaluate each solution and predict its quality.
Filter out code patches that fail regression and reproduction tests.
Select the solution with the highest score as the final answer. This method achieves substantially better results without modifying the underlying agent model and scaffold. We observe a log - linear performance improvement from 60.6% on a single trajectory rollout to 66.4% with five attempts, making our submission number one on the leaderboard.

Dedicated Critic Model

Rather than using a prompt - based reranking strategy, we trained a dedicated critic model. The training process includes:

Rolling out agent trajectories from [SWE - Gym](https://github.com/SWE - Gym/SWE - Gym) to avoid data leakage.
Implementing a temporal difference (TD) learning objective to propagate trajectory - level success signals backward through each trajectory.
Adding a regression head on top of the last layer to predict reward values. The TD learning objective helps the model understand which actions contributed to the final outcome: $$ r_t = \gamma r_{t+1} $$ Where $r_t$ is the reward at time step $t$ (i.e., the t - th action produced by the agent), $\gamma$ is the discount factor. We use $\gamma = 0.99$. We use veRL to finetune [Qwen 2.5 Coder Instruct 32B](https://huggingface.co/Qwen/Qwen2.5 - Coder - 32B - Instruct) as a critic model. The critic model is [publicly available on huggingface](https://huggingface.co/all - hands/openhands - critic - 32b - exp - 20250417) for researchers.

Generalization and Future Improvements

Genuine usefulness through generalization: A trained critic model could generalize to diverse software engineering scenarios beyond SWE - Bench, making it a valuable tool for real - world coding tasks.
Use intermediate reward for future improvements: The intermediate rewards predicted throughout each trajectory offer possibilities for enhancing our agent's capabilities, such as one - step lookahead sampling and real - time mistake recovery.

📚 Documentation

SWE - Bench and OpenHands

SWE - bench is a popular benchmark for evaluating large language models' capabilities in addressing real - world software engineering challenges. It consists of issues and corresponding pull requests from 12 popular Python repositories on GitHub. The verified subset we evaluated on has 500 carefully selected test cases manually reviewed by [human software developers](https://openai.com/index/introducing - swe - bench - verified/). We're developing the [OpenHands](https://github.com/All - Hands - AI/OpenHands) open - source software development agent, and its performance on this dataset is currently 60.6%.

Why We Built a Critic Model and Where It's Going

We chose to build a trained critic model for the following reasons:

Genuine usefulness through generalization: Prompt - engineering - based rerankers may not guarantee real - world generalization, while a trained critic model could generalize to diverse scenarios.
Use intermediate reward for future improvements: Intermediate rewards can be used for one - step lookahead sampling (experimental [PR](https://github.com/All - Hands - AI/OpenHands/pull/7770)) and real - time mistake recovery ([issue](https://github.com/All - Hands - AI/OpenHands/issues/2221)).

🔧 Technical Details

Training the Critic Model

We roll out agent trajectories from [SWE - Gym](https://github.com/SWE - Gym/SWE - Gym) to avoid data leakage.
Implement a temporal difference (TD) learning objective:
- The TD learning objective propagates trajectory - level success signals backward through each trajectory.
- The formula is $r_t=\gamma r_{t + 1}$, where $r_t$ is the reward at time step $t$, and $\gamma = 0.99$.
Add a regression head on top of the last layer to predict reward values.

Serving the Critic Model

We use veRL to finetune [Qwen 2.5 Coder Instruct 32B](https://huggingface.co/Qwen/Qwen2.5 - Coder - 32B - Instruct) as a critic model. During inference, we use a [modified version of vLLM](https://github.com/xingyaoww/vllm/tree/add - token - classification - support) to serve this model.

📄 License

This project is released under the MIT license.

⚠️ Important Note

This model is released strictly for research and is not yet compatible with the OpenHands application. For complete information about this model, including its capabilities and limitations, please refer to our [detailed blog post](https://www.all - hands.dev/blog/sota - on - swe - bench - verified - with - inference - time - scaling - and - critic - model).

Property	Details
Model Type	OpenHands Critic Model
Base Model	Qwen/Qwen2.5 - Coder - 32B - Instruct
Pipeline Tag	token - classification
Tags	agent, coding

OpenHands Critic Model

Blog

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご