🚀 STILL-3-1.5B-preview:慢思考推理模型
我們發佈了 STILL-3-1.5B-preview,這是一個慢思考推理模型,在AIME基準測試中達到了39.33%的準確率!我們在15億參數的模型上應用了強化學習,並觀察到隨著訓練步數的增加,模型性能持續提升。為了更好地復現我們的工作並推動研究進展,我們開源了代碼、模型和數據。
代碼鏈接:https://github.com/RUCAIBox/Slow_Thinking_with_LLMs
🚀 快速開始
from transformers import AutoTokenizer, AutoModelForCausalLM
from vllm import LLM, SamplingParams
tokenizer = AutoTokenizer.from_pretrained("RUC-AIBOX/STILL-3-1.5B-preview")
model = AutoModelForCausalLM.from_pretrained("RUC-AIBOX/STILL-3-1.5B-preview")
question = "Convert the point $(0,3)$ in rectangular coordinates to polar coordinates. Enter your answer in the form $(r,\\theta),$ where $r > 0$ and $0 \\le \\theta < 2 \\pi.$"
input_prompts = tokenizer.apply_chat_template(
[
{"role": "user", "content": question}],
tokenize=False,
add_generation_prompt=True
)
llm = LLM(model=model_path, tensor_parallel_size=1, dtype='bfloat16')
sampling_params_gs = SamplingParams(temperature=0.6, top_p=0.95, max_tokens=32768, stop=stop_words, seed=42, skip_special_tokens=False)
responses = model.generate(input_prompts, sampling_params)
print(responses[0].outputs[0].text)
✨ 主要特性
我們對模型在四個基準測試上進行了評估:MATH、AIME、OMNI和LiveAOPS。對於MATH和AIME,我們採用了採樣解碼設置,採樣溫度為0.6,top-p採樣概率為0.95。每個問題採樣64次,並計算平均分。對於OMNI和LiveAOPS(2024年8月 - 11月),我們隨機抽取了一部分答案作為整數以方便自動評估,並使用貪心搜索解碼進行評估。訓練後的模型STILL-3-1.5B-preview取得了顯著的改進。AIME任務的準確率從28.67%提高到39.33%,相對提升了37.18%。
|
MATH |
AIME |
OMNI |
LiveAOPS |
平均 |
基礎模型 |
84.04 |
28.67 |
25.60 |
33.33 |
42.91 |
STILL-3-1.5B-preview |
85.48 |
39.33 |
33.00 |
39.50 |
49.33 |
📚 詳細文檔
如果我們的報告對您的研究有幫助,請引用以下內容:
@article{Slow_Thinking_with_LLMs_3_Preview,
title={STILL-3-1.5B-preview: Enhancing Slow Thinking Abilities of Small Models through Reinforcement Learning
},
author={RUCAIBox STILL Team},
url={https://github.com/RUCAIBox/Slow_Thinking_with_LLMs},
year={2025}
}
@article{Slow_Thinking_with_LLMs_1,
title={Enhancing LLM Reasoning with Reward-guided Tree Search},
author={Jiang, Jinhao and Chen, Zhipeng and Min, Yingqian and Chen, Jie and Cheng, Xiaoxue and Wang, Jiapeng and Tang, Yiru and Sun, Haoxiang and Deng, Jia and Zhao, Wayne Xin and Liu, Zheng and Yan, Dong and Xie, Jian and Wang, Zhongyuan and Wen, Ji-Rong},
journal={arXiv preprint arXiv:2411.11694},
year={2024}
}
@article{Slow_Thinking_with_LLMs_2,
title={Imitate, Explore, and Self-Improve: A Reproduction Report on Slow-thinking Reasoning Systems},
author={Min, Yingqian and Chen, Zhipeng and Jiang, Jinhao and Chen, Jie and Deng, Jia and Hu, Yiwen and Tang, Yiru and Wang, Jiapeng and Cheng, Xiaoxue and Song, Huatong and Zhao, Wayne Xin and Liu, Zheng and Wang, Zhongyuan and Wen, Ji-Rong},
journal={arXiv preprint arXiv:2412.09413},
year={2024}
}