Model Overview
Model Features
Model Capabilities
Use Cases
đ Skywork o1 Open Model Series
The Skywork o1 Open model series, developed by the Skywork team at Kunlun Inc, introduces models with o1 - like slow - thinking and reasoning capabilities, aiming to advance AI reasoning tasks.
⨠Features
- Innovative Reasoning: Incorporates o1 - like slow thinking and reasoning capabilities.
- Multiple Model Options: Consists of three advanced models, each with unique strengths for different reasoning tasks.
- Improved Performance: Shows significant improvements in reasoning skills on standard benchmarks.
đĻ Installation
No installation steps are provided in the original README.
đģ Usage Examples
Basic Usage
For detailed instructions on using PRM, please refer to the Skywork - o1 - PRM - Inference Repository.
Advanced Usage
huggingface inference
- Clone the Skywork PRM inference repo:
git clone https://github.com/SkyworkAI/skywork-o1-prm-inference.git
cd skywork-o1-prm-inference
- Run PRM inference:
from transformers import AutoTokenizer
from model_utils.prm_model import PRM_MODEL
from model_utils.io_utils import prepare_input, prepare_batch_input_for_model, derive_step_rewards
prm_model_path = "/path/to/prm_model"
tokenizer = AutoTokenizer.from_pretrained(prm_model_path, trust_remote_code=True)
datas = [
{
"problem" : "Janet\u2019s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?",
"response" : "To determine how much money Janet makes every day at the farmers' market, we need to follow these steps:\n1. **Calculate the total number of eggs laid by the ducks per day.**\n Janet's ducks lay 16 eggs per day.\n2. **Determine the number of eggs Janet uses each day.**\n - She eats 3 eggs for breakfast every morning.\n - She bakes muffins for her friends every day with 4 eggs.\n So, the total number of eggs used per day is:\n \\[\n 3 + 4 = 7 \\text{ eggs}\n \\]\n3. **Calculate the number of eggs Janet sells at the farmers' market each day.**\n Subtract the number of eggs used from the total number of eggs laid:\n \\[\n 16 - 7 = 9 \\text{ eggs}\n \\]\n4. **Determine how much money Janet makes from selling the eggs.**\n She sells each egg for $2, so the total amount of money she makes is:\n \\[\n 9 \\times 2 = 18 \\text{ dollars}\n \\]\nTherefore, the amount of money Janet makes every day at the farmers' market is $\\boxed{18}$."
},
{
"problem" : "Janet\u2019s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?",
"res"
}
]
đ Documentation
Introduction
We are excited to announce the release of the Skywork o1 Open model series, developed by the Skywork team at Kunlun Inc. This groundbreaking release introduces a series of models that incorporate o1 - like slow thinking and reasoning capabilities. The Skywork o1 Open model series includes three advanced models:
- [Skywork o1 Open - Llama - 3.1 - 8B](https://huggingface.co/Skywork/Skywork - o1 - Open - Llama3.1 - 8B): A robust chat model trained on Llama - 3.1 - 8B, enhanced significantly with "o1 - style" data to improve reasoning skills.
- [Skywork o1 Open - PRM - Qwen - 2.5 - 1.5B](https://huggingface.co/Skywork/Skywork - o1 - Open - PRM - Qwen2.5 - 1.5B): A specialized model designed to enhance reasoning capability through incremental process rewards, ideal for complex problem solving at a smaller scale.
- [Skywork o1 Open - PRM - Qwen - 2.5 - 7B](https://huggingface.co/Skywork/Skywork - o1 - Open - PRM - Qwen2.5 - 7B): Extends the capabilities of the 1.5B model by scaling up to handle more demanding reasoning tasks, pushing the boundaries of AI reasoning.
Different from mere reproductions of the OpenAI o1 model, the Skywork o1 Open model series not only exhibits innate thinking, planning, and reflecting capabilities in its outputs, but also shows significant improvements in reasoning skills on standard benchmarks. This series represents a strategic advancement in AI capabilities, moving a previously weaker base model towards the state - of - the - art (SOTA) in reasoning tasks.
If you are interested in the Skywork o1 Open model series, please check out the [o1 - llama - 3.1 - 8b](https://huggingface.co/Skywork/o1 - llama - 3.1 - 8b) model.
Model Information
The Skywork - o1 - Open - PRM series are trained on [Qwen2.5 - Math - 1.5B - Instruct](https://huggingface.co/Qwen/Qwen2.5 - Math - 1.5B - Instruct) and [Qwen2.5 - Math - 7B - Instruct](https://huggingface.co/Qwen/Qwen2.5 - Math - 7B - Instruct).
PRM Evaluation
Evaluation Settings
Mathematical Evaluation
We utilized the evaluation scripts from [Qwen2.5 - Math](https://github.com/QwenLM/Qwen2.5 - Math) and followed their configuration to ensure consistency. The selected datasets include GSM8K, MATH, GaoKao, CN - Middle School 24, OlympiadBench, AIME - 24, and AMC - 23. Among these, GaoKao and CN - Middle School 24 are Chinese datasets, while the remaining datasets are in English. Notably, OlympiadBench, AIME - 24, and AMC - 23 are competition - level datasets.
Code Evaluation
For code evaluation, we adopted the evaluation scripts from [Qwen2.5 - Coder](https://github.com/QwenLM/Qwen2.5 - Coder) while largely maintaining the same configuration. The selected datasets include HumanEval, MBPP, and LiveCodeBench, with LiveCodeBench specifically using the version 2024.01 - 2024 - 11. We use the latest version (0.3.1) of evalplus due to issues with tests and code sanitization in previous versions.
Evaluation Base Models
We evaluated the performance of RMs on three base models: Qwen2.5 - 7B - Instruct, Llama3.1 - 8B - Instruct, and Skywork - o1 - Open - 8B. Data sampling was conducted to verify the performance of the RMs across different models. The sampling temperature was set to 0.7 for mathematical problems and 1.0 for code - related tasks.
Compared RMs
- [Qwen2.5 - Math - RM - 72B](https://huggingface.co/Qwen/Qwen2.5 - Math - RM - 72B): An open - source ORM provided by the Qwen team.
- [OpenR - MATH - psa - PRM - 7B](https://huggingface.co/openreasoner/Math - psa): An open - source PRM from the OpenR project.
- [RLHFlow - Deepseek - Data - PRM - 8B](https://huggingface.co/RLHFlow/Llama3.1 - 8B - PRM - Deepseek - Data): An open - source PRM from the RLHFlow project.
Evaluation Metrics
- Greedy Sampling Pass@1: Uses greedy sampling for generating answers.
- Majority Voting@64: Randomly samples 64 answers.
- Best - of - N@64: Ranks 64 answers based on output values provided by the Reward Model (RM). The weighting methods differ for ORM and PRM:
- For ORM, only the reward from the final step is used.
- For PRM, the average reward across all steps is used for weighting.
Evaluation Results
Mathematical Evaluation
Model | Reward Model | Method | GSM8K | MATH | GaoKao | CN - Middle School 24 | OlympiadBench | AIME - 24 | AMC - 23 | Avg |
---|---|---|---|---|---|---|---|---|---|---|
Skywork - o1 - Open - 8B | N/A | Greedy Sampling Pass@1 | 91.6 | 78.1 | 63.6 | 67.3 | 43.1 | 13.3 | 55.0 | 58.9 |
Skywork - o1 - Open - 8B | N/A | Majority Voting@64 | 93.9 | 84.3 | 69.5 | 73.3 | 50.4 | 16.7 | 52.5 | 62.9 |
Skywork - o1 - Open - 8B | OpenR - MATH - psa - PRM - 7B | Best - of - N@64 | 95.1 | 82.7 | 67.1 | 70.3 | 47.6 | 20.0 | 57.5 | 62.9 |
Skywork - o1 - Open - 8B | RLHFlow - Deepseek - Data - PRM - 8B | Best - of - N@64 | 94.4 | 80.1 | 59.1 | 74.3 | 43.0 | 20.0 | 50.0 | 60.1 |
Skywork - o1 - Open - 8B | Qwen2.5 - Math - RM - 72B | Best - of - N@64 | 96.1 | 86.9 | 76.0 | 76.2 | 53.3 | 26.7 | 65.0 | 68.6 |
Skywork - o1 - Open - 8B | Skywork - o1 - Open - PRM - 1.5B | Best - of - N@64 | 94.5 | 85.0 | 65.6 | 73.3 | 49.9 | 16.7 | 62.5 | 63.9 |
Skywork - o1 - Open - 8B | Skywork - o1 - Open - PRM - 7B | Best - of - N@64 | 96.7 | 87.0 | 70.3 | 76.2 | 52.3 | 23.3 | 65.0 | 67.3 |
Qwen2.5 - 7B - Instruct | N/A | Greedy Sampling Pass@1 | 91.9 | 75.2 | 55.6 | 75.2 | 39.1 | 13.3 | 45.0 | 56.5 |
Qwen2.5 - 7B - Instruct | N/A | Majority Voting@64 | 93.5 | 78.4 | 55.3 | 78.2 | 40.1 | 13.3 | 62.5 | 60.2 |
Qwen2.5 - 7B - Instruct | OpenR - MATH - psa - PRM - 7B | Best - of - N@64 | 93.9 | 77.9 | 52.4 | 73.3 | 40.7 | 10.0 | 55.0 | 57.6 |
Qwen2.5 - 7B - Instruct | RLHFlow - Deepseek - Data - PRM - 8B | Best - of - N@64 | 94.1 | 78.1 | 53.2 | 75.2 | 39.1 | 16.7 | 55.0 | 58.8 |
Qwen2.5 - 7B - Instruct | Qwen2.5 - Math - RM - 72B | Best - of - N@64 | 94.8 | 82.4 | 65.2 | 80.2 | 45.0 | 13.3 | 62.5 | 63.4 |
Qwen2.5 - 7B - Instruct | Skywork - o1 - Open - PRM - 1.5B | Best - of - N@64 | 93.3 | 79.8 | 56.1 | 74.3 | 43.9 | 10.0 | 67.5 | 60.7 |
Qwen2.5 - 7B - Instruct | Skywork - o1 - Open - PRM - 7B | Best - of - N@64 | 94.9 | 81.9 | 56.3 | 75.2 | 44.9 | 13.3 | 65.0 | 61.6 |
Llama3.1 - 8B - Instruct | N/A | Greedy Sampling Pass@1 | 85.3 | 49.7 | 25.3 | 47.5 | 16.6 | 6.7 | 27.5 | 36.9 |
Llama3.1 - 8B - Instruct | N/A | Majority Voting@64 | 90.9 | 62.9 | 28.0 | 56.4 | 26.4 | 13.3 | 37.5 | 45.1 |
Llama3.1 - 8B - Instruct | OpenR - MATH - psa - PRM - 7B | Best - of - N@64 | 91.8 | 59.4 | 24.7 | 47.5 | 23.0 | 13.3 | 35.0 | 42.1 |
Llama3.1 - 8B - Instruct | RLHFlow - Deepseek - Data - PRM - 8B | Best - of - N@64 | 89.8 | 56.1 | 24.0 | 40.6 | 20.4 | 0.0 | 35.0 | 38.0 |
Llama3.1 - 8B - Instruct | Qwen2.5 - Math - RM - 72B | Best - of - N@64 | 94.9 | 72.5 | 44.9 | 65.3 | 34.4 | 23.3 | 60.0 | 56.5 |
Llama3.1 - 8B - Instruct | Skywork - o1 - Open - PRM - 1.5B | Best - of - N@64 | 91.7 | 65.6 | 26.8 | 49.5 | 27.0 | 16.7 | 60.0 | 48.2 |
Llama3.1 - 8B - Instruct | Skywork - o1 - Open - PRM - 7B | Best - of - N@64 | 94.0 | 69.8 | 32.0 | 56.4 | 29.9 | 16.7 | 52.5 | 50.2 |
Code Evaluation
Since the compared PRMs have not been trained on code - related tasks, this section focuses solely on the performance of Skywork - o1 - Open - PRM.
Model | Reward Model | Method | MBPP | MBPP+ | HumanEval | HumanEval+ | LiveCodeBench - 2024.01 - 2024 - 11 |
---|---|---|---|---|---|---|---|
Skywork - o1 - Open - 8B | N/A | Greedy Sampling Pass@1 | 79.9 | 65.9 | 82.9 | 78.7 | 26.0 |
Skywork - o1 - Open - 8B | Skywork - o1 - Open - PRM - 7B | Best - of - N@64 | 81.2 | 68.5 | 81.1 | 74.4 | 31.3 |
Qwen2.5 - 7B - Instruct | N/A | Greedy Sampling Pass@1 | 81.7 | 69.3 | 84.8 | 78.0 | 25.3 |
Qwen2.5 - 7B - Instruct | Skywork - o1 - Open - PRM - 7B | Best - of - N@64 | 84.9 | 72.5 | 83.5 | 78.0 | 30.7 |
Llama3.1 - 8B - Instruct | N/A | Greedy Sampling Pass@1 | 75.7 | 64.0 | 68.3 | 63.4 | 16.1 |
Llama3.1 - 8B - Instruct | Skywork - o1 - Open - PRM - 7B | Best - of - N@64 | 77.8 | 66.1 | 73.8 | 67.7 | 23.3 |
đ§ Technical Details
No technical details are provided in the original README.
đ License
The license is "other".

