đ OpenHands Critic Model
OpenHands Critic Model achieves state - of - the - art results on SWE - Bench Verified, offering a new approach to software engineering tasks with inference - time scaling and a dedicated critic model.
đ Quick Start
If you're eager to try OpenHands:
- Start with OpenHands Cloud: The simplest way is to use our fully managed cloud solution. It comes with $50 free credits, seamless GitHub integration, mobile support, and optimizations like context condensation.
- Contribute to Open Source: You can star, open issues, or send PRs to our GitHub repository to contribute to open - source AI software development.
- Join Our Community: Connect with us on Slack, read our documentation, and keep up with our latest progress.
⨠Features
SOTA on SWE - Bench Verified
OpenHands has reached a new milestone by achieving state - of - the - art results on SWE - Bench Verified.

Inference - Time Scaling
Our approach leverages the idea of trying multiple solutions for challenging software engineering tasks and picking the best one. The steps are as follows:
- Run the OpenHands agent multiple times using claude 3.7 sonnet with sampling temperature 1.0 for each SWE - Bench problem to generate multiple code patches.
- Train a "critic model" to evaluate each solution and predict its quality.
- Filter out code patches that fail regression and reproduction tests.
- Select the solution with the highest score as the final answer.
This method achieves substantially better results without modifying the underlying agent model and scaffold. We observe a log - linear performance improvement from 60.6% on a single trajectory rollout to 66.4% with five attempts, making our submission number one on the leaderboard.

Dedicated Critic Model
Rather than using a prompt - based reranking strategy, we trained a dedicated critic model. The training process includes:
- Rolling out agent trajectories from [SWE - Gym](https://github.com/SWE - Gym/SWE - Gym) to avoid data leakage.
- Implementing a temporal difference (TD) learning objective to propagate trajectory - level success signals backward through each trajectory.
- Adding a regression head on top of the last layer to predict reward values.
The TD learning objective helps the model understand which actions contributed to the final outcome:
$$
r_t = \gamma r_{t+1}
$$
Where $r_t$ is the reward at time step $t$ (i.e., the t - th action produced by the agent), $\gamma$ is the discount factor. We use $\gamma = 0.99$.
We use veRL to finetune [Qwen 2.5 Coder Instruct 32B](https://huggingface.co/Qwen/Qwen2.5 - Coder - 32B - Instruct) as a critic model. The critic model is [publicly available on huggingface](https://huggingface.co/all - hands/openhands - critic - 32b - exp - 20250417) for researchers.
Generalization and Future Improvements
- Genuine usefulness through generalization: A trained critic model could generalize to diverse software engineering scenarios beyond SWE - Bench, making it a valuable tool for real - world coding tasks.
- Use intermediate reward for future improvements: The intermediate rewards predicted throughout each trajectory offer possibilities for enhancing our agent's capabilities, such as one - step lookahead sampling and real - time mistake recovery.
đ Documentation
SWE - Bench and OpenHands
SWE - bench is a popular benchmark for evaluating large language models' capabilities in addressing real - world software engineering challenges. It consists of issues and corresponding pull requests from 12 popular Python repositories on GitHub. The verified subset we evaluated on has 500 carefully selected test cases manually reviewed by [human software developers](https://openai.com/index/introducing - swe - bench - verified/).
We're developing the [OpenHands](https://github.com/All - Hands - AI/OpenHands) open - source software development agent, and its performance on this dataset is currently 60.6%.
Why We Built a Critic Model and Where It's Going
We chose to build a trained critic model for the following reasons:
- Genuine usefulness through generalization: Prompt - engineering - based rerankers may not guarantee real - world generalization, while a trained critic model could generalize to diverse scenarios.
- Use intermediate reward for future improvements: Intermediate rewards can be used for one - step lookahead sampling (experimental [PR](https://github.com/All - Hands - AI/OpenHands/pull/7770)) and real - time mistake recovery ([issue](https://github.com/All - Hands - AI/OpenHands/issues/2221)).
đ§ Technical Details
Training the Critic Model
- We roll out agent trajectories from [SWE - Gym](https://github.com/SWE - Gym/SWE - Gym) to avoid data leakage.
- Implement a temporal difference (TD) learning objective:
- The TD learning objective propagates trajectory - level success signals backward through each trajectory.
- The formula is $r_t=\gamma r_{t + 1}$, where $r_t$ is the reward at time step $t$, and $\gamma = 0.99$.
- Add a regression head on top of the last layer to predict reward values.
Serving the Critic Model
We use veRL to finetune [Qwen 2.5 Coder Instruct 32B](https://huggingface.co/Qwen/Qwen2.5 - Coder - 32B - Instruct) as a critic model. During inference, we use a [modified version of vLLM](https://github.com/xingyaoww/vllm/tree/add - token - classification - support) to serve this model.
đ License
This project is released under the MIT license.
â ī¸ Important Note
This model is released strictly for research and is not yet compatible with the OpenHands application. For complete information about this model, including its capabilities and limitations, please refer to our [detailed blog post](https://www.all - hands.dev/blog/sota - on - swe - bench - verified - with - inference - time - scaling - and - critic - model).
Property |
Details |
Model Type |
OpenHands Critic Model |
Base Model |
Qwen/Qwen2.5 - Coder - 32B - Instruct |
Pipeline Tag |
token - classification |
Tags |
agent, coding |
OpenHands Critic Model
Blog