Model Overview
Model Features
Model Capabilities
Use Cases
๐ Skywork-SWE
Skywork-SWE-32B is a code agent model developed by Skywork AI, specifically designed for software engineering tasks. It shows strong performance in multiple key metrics and is accompanied by an efficient and automated SWE data collection pipeline.
๐ Technical Report | ๐ Blog
โจ Features
Skywork-SWE-32B is a code agent model developed by Skywork AI, tailored for software engineering (SWE) tasks. It exhibits remarkable performance across several crucial metrics:
- Skywork-SWE-32B achieves a 38.0% pass@1 accuracy on the SWE-bench Verified benchmark, outperforming previous open - source state - of - the - art (SoTA) Qwen2.5-Coder-32B-based large language models (LLMs) built on the OpenHands agent framework.
- When integrated with test - time scaling techniques, its performance further improves to 47.0% accuracy, surpassing the previous SoTA results for sub - 32B parameter models.
- We clearly demonstrate the data scaling law phenomenon for software engineering capabilities in LLMs, with no signs of saturation at 8209 collected training trajectories.
We also introduce an efficient and automated pipeline for SWE data collection, which results in the creation of the Skywork - SWE datasetโa large - scale, high - quality dataset featuring comprehensive executable runtime environments. Detailed descriptions are available in our technical report.
๐ Model Details
Property | Details |
---|---|
Model Name | Skywork-SWE-32B |
Backbone LLM | ๐ค Qwen2.5-Coder-32B-Instruct |
HuggingFace Link | ๐ค Skywork-SWE-32B |
Technical Report | Technical Report |
Blog | Blog |
๐ Evaluation
Data Scaling Law for Pass@1 Accuracy on Qwen2.5-Coder-32B-Based LLMs Using the OpenHands v0.32.0 Code Agent Framework. Skywork-SWE-32B significantly outperforms previous Qwen2.5-Coder-32B-based LLMs, achieving the highest pass@1 accuracy without using verifiers or multiple rollouts.
With the incorporation of test - time scaling techniques, Skywork-SWE-32B further improves to 47.0% accuracy, surpassing the previous SoTA results for sub - 32B parameter models.
๐ Performance Summary
- Skywork-SWE-32B:
Submission summary on SWE-bench verified split
==================================================
Resolved 190 instances (38.0%)
==================================================
Resolved by Repository
- astropy/astropy: 4/22 (18.18%)
- django/django: 99/231 (42.86%)
- matplotlib/matplotlib: 9/34 (26.47%)
- mwaskom/seaborn: 0/2 (0.0%)
- pallets/flask: 1/1 (100.0%)
- psf/requests: 4/8 (50.0%)
- pydata/xarray: 7/22 (31.82%)
- pylint-dev/pylint: 2/10 (20.0%)
- pytest-dev/pytest: 9/19 (47.37%)
- scikit-learn/scikit-learn: 17/32 (53.12%)
- sphinx-doc/sphinx: 13/44 (29.55%)
- sympy/sympy: 25/75 (33.33%)
==================================================
Resolved by Time
- 2013: 2/3 (66.67%)
- 2014: 2/2 (100.0%)
- 2015: 0/1 (0.0%)
- 2016: 2/2 (100.0%)
- 2017: 5/16 (31.25%)
- 2018: 7/24 (29.17%)
- 2019: 46/98 (46.94%)
- 2020: 43/108 (39.81%)
- 2021: 27/86 (31.4%)
- 2022: 35/102 (34.31%)
- 2023: 21/58 (36.21%)
- Skywork-SWE-32B + TTS (Bo8):
Submission summary on SWE-bench verified split
==================================================
Resolved 235 instances (47.0%)
==================================================
Resolved by Repository
- astropy/astropy: 8/22 (36.36%)
- django/django: 115/231 (49.78%)
- matplotlib/matplotlib: 15/34 (44.12%)
- mwaskom/seaborn: 0/2 (0.0%)
- pallets/flask: 1/1 (100.0%)
- psf/requests: 3/8 (37.5%)
- pydata/xarray: 14/22 (63.64%)
- pylint-dev/pylint: 4/10 (40.0%)
- pytest-dev/pytest: 10/19 (52.63%)
- scikit-learn/scikit-learn: 22/32 (68.75%)
- sphinx-doc/sphinx: 12/44 (27.27%)
- sympy/sympy: 31/75 (41.33%)
==================================================
Resolved by Time
- 2013: 1/3 (33.33%)
- 2014: 1/2 (50.0%)
- 2015: 0/1 (0.0%)
- 2016: 2/2 (100.0%)
- 2017: 6/16 (37.5%)
- 2018: 9/24 (37.5%)
- 2019: 52/98 (53.06%)
- 2020: 48/108 (44.44%)
- 2021: 40/86 (46.51%)
- 2022: 46/102 (45.1%)
- 2023: 30/58 (51.72%)
๐ป Usage Examples
Basic Usage
Install vLLM package
# Install vLLM version 0.9.0.1.
# For example, if your CUDA version is 12.8, use the following command:
pip install vllm==0.9.0.1 --extra-index-url https://download.pytorch.org/whl/cu128
Launch a server to deploy Skywork-SWE-32B
vllm serve ${MODEL_PATH} โserved-model-name ${SERVED_MODEL_NAME} --host 0.0.0.0 --port 8000 --gpu-memory-utilization 0.95 --tensor-parallel-size 8
โ ๏ธ Important Note
Since our model has 32 billion parameters and supports a 32K context length, we recommend launching the model server with at least 2 GPUs equipped with sufficient VRAM to ensure efficient inference.
Set up OpenHands framework
git clone https://github.com/All-Hands-AI/OpenHands.git
cd OpenHands
git checkout tags/0.32.0
make build
The official documentation of OpenHands: SWE-Bench Evaluation with OpenHands SWE-Bench Docker Image
Create the corresponding config file:
[core]
workspace_base="./workspace"
[llm.my-oss-model]
model = "openai/${SERVED_MODEL_NAME}"
base_url = "http://0.0.0.0:8000/v1"
api_key="vllm"
max_message_chars=32768
max_input_tokens=32768
max_output_tokens=8192
log_completions=true
temperature=0.0
Advanced Usage
If you want to run the OpenHands agent with test - time scaling techniques (a Best - of - N method based on the critic model), please refer to the [blog](https://www.all - hands.dev/blog/sota - on - swe - bench - verified - with - inference - time - scaling - and - critic - model) for detailed instructions. You will need to switch to the [feature/llm - critic](https://github.com/All - Hands - AI/OpenHands/tree/feature/llm - critic) branch and deploy the [critic model](https://huggingface.co/all - hands/openhands - critic - 32b - exp - 20250417) accordingly. Additionally, you need to add the following parameters into the configuration file:
use_critic=true
critic_model="critic_model"
critic_base_url="**********"
critic_api_key="************"
critic_num_candidates=2
Rollout on SWE - Bench Instances
./evaluation/benchmarks/swe_bench/scripts/run_infer.sh [model_config] [git - version] [agent] [eval_limit] [max_iter] [num_workers] [dataset] [dataset_split]
# Example
./evaluation/benchmarks/swe_bench/scripts/run_infer.sh llm.my-oss-model HEAD CodeActAgent 500 100 1 princeton-nlp/SWE-bench_Verified test
Evaluate generated patches
./evaluation/benchmarks/swe_bench/scripts/eval_infer.sh \
./evaluation_outputs/outputs/princeton-nlp__SWE-bench_Lite-test/CodeActAgent/my-oss-model_maxiter_100_N_v0.32.0-no-hint-run_1/output.jsonl
๐ Acknowledgements
We would like to thank the contributors of the OpenHands and [AllHands Critic](https://huggingface.co/all - hands/openhands - critic - 32b - exp - 20250417) repositories for their open research and valuable contributions.
๐ License
The model is licensed under the apache - 2.0 license.
๐ Citation
If you use Skywork - SWE in your research, please consider citing our work using the following BibTeX entry:
@misc{skywork-swe,
title={Skywork-SWE: Unveiling Data Scaling Laws for Software Engineering in LLMs},
author={Liang Zeng, Yongcong Li, Yuzhen Xiao, Changshi Li, Chris Yuhao Liu, Rui Yan, Tianwen Wei, Jujie He, Xuchen Song, Yang Liu, and Yahui Zhou},
howpublished={\url{https://quixotic-sting-239.notion.site/eb17f379610040ceb54da5d5d24065bd}},
note={Notion Blog},
year={2025},
}

