đ InternLM-Math-Plus
InternLM-Math-Plus is a state-of-the-art bilingual open-sourced Math reasoning LLM. It serves as a solver, prover, verifier, and augmentor, offering advanced capabilities for math-related tasks.
đ Quick Start
The project provides multiple model sizes and has achieved excellent performance in both formal and informal math reasoning. You can access the models on GitHub and try the demo on Hugging Face.
⨠Features
- Multiple Model Sizes: Available in 1.8B, 7B, 20B, and 8x22B sizes to meet different needs.
- Formal and Informal Math Reasoning: Performs well in both formal math reasoning (e.g., on MiniF2F-test) and informal math reasoning (e.g., on MATH and GSM8K).
- Bilingual Support: Supports both English and Chinese.
đĻ News
- [2024.05.24] We released the updated version InternLM2-Math-Plus with 4 sizes and state-of-the-art performances including 1.8B, 7B, 20B, and 8x22B. We significantly improved informal math reasoning performance (chain-of-thought and code-intepreter) and formal math reasoning performance (LEAN 4 translation and LEAN 4 theorem proving).
- [2024.02.10] We added tech reports and citation reference.
- [2024.01.31] We added MiniF2F results with evaluation codes!
- [2024.01.29] We added checkpoints from ModelScope. Updated results about majority voting and Code Intepreter. The tech report is on the way!
- [2024.01.26] We added checkpoints from OpenXLab, which makes it easier for Chinese users to download!
đ Performance
Formal Math Reasoning
We evaluated the performance of InternLM2-Math-Plus on the formal math reasoning benchmark MiniF2F-test. The evaluation setting is the same as Llemma with LEAN 4.
Models |
MiniF2F-test |
ReProver |
26.5 |
LLMStep |
27.9 |
GPT-F |
36.6 |
HTPS |
41.0 |
Llemma-7B |
26.2 |
Llemma-34B |
25.8 |
InternLM2-Math-7B-Base |
30.3 |
InternLM2-Math-20B-Base |
29.5 |
InternLM2-Math-Plus-1.8B |
38.9 |
InternLM2-Math-Plus-7B |
43.4 |
InternLM2-Math-Plus-20B |
42.6 |
InternLM2-Math-Plus-Mixtral8x22B |
37.3 |
Informal Math Reasoning
We evaluated the performance of InternLM2-Math-Plus on the informal math reasoning benchmarks MATH and GSM8K. InternLM2-Math-Plus-1.8B outperforms MiniCPM-2B in the smallest size setting. InternLM2-Math-Plus-7B outperforms Deepseek-Math-7B-RL, which is the state-of-the-art math reasoning open source model. InternLM2-Math-Plus-Mixtral8x22B achieves 68.5 on MATH (with Python) and 91.8 on GSM8K.
Model |
MATH |
MATH-Python |
GSM8K |
MiniCPM-2B |
10.2 |
- |
53.8 |
InternLM2-Math-Plus-1.8B |
37.0 |
41.5 |
58.8 |
InternLM2-Math-7B |
34.6 |
50.9 |
78.1 |
Deepseek-Math-7B-RL |
51.7 |
58.8 |
88.2 |
InternLM2-Math-Plus-7B |
53.0 |
59.7 |
85.8 |
InternLM2-Math-20B |
37.7 |
54.3 |
82.6 |
InternLM2-Math-Plus-20B |
53.8 |
61.8 |
87.7 |
Mixtral8x22B-Instruct-v0.1 |
41.8 |
- |
78.6 |
Eurux-8x22B-NCA |
49.0 |
- |
- |
InternLM2-Math-Plus-Mixtral8x22B |
58.1 |
68.5 |
91.8 |
We also evaluated the models on MathBench-A. InternLM2-Math-Plus-Mixtral8x22B has comparable performance compared to Claude 3 Opus.
Model |
Arithmetic |
Primary |
Middle |
High |
College |
Average |
GPT-4o-0513 |
77.7 |
87.7 |
76.3 |
59.0 |
54.0 |
70.9 |
Claude 3 Opus |
85.7 |
85.0 |
58.0 |
42.7 |
43.7 |
63.0 |
Qwen-Max-0428 |
72.3 |
86.3 |
65.0 |
45.0 |
27.3 |
59.2 |
Qwen-1.5-110B |
70.3 |
82.3 |
64.0 |
47.3 |
28.0 |
58.4 |
Deepseek-V2 |
82.7 |
89.3 |
59.0 |
39.3 |
29.3 |
59.9 |
Llama-3-70B-Instruct |
70.3 |
86.0 |
53.0 |
38.7 |
34.7 |
56.5 |
InternLM2-Math-Plus-Mixtral8x22B |
77.5 |
82.0 |
63.6 |
50.3 |
36.8 |
62.0 |
InternLM2-Math-20B |
58.7 |
70.0 |
43.7 |
24.7 |
12.7 |
42.0 |
InternLM2-Math-Plus-20B |
65.8 |
79.7 |
59.5 |
47.6 |
24.8 |
55.5 |
Llama3-8B-Instruct |
54.7 |
71.0 |
25.0 |
19.0 |
14.0 |
36.7 |
InternLM2-Math-7B |
53.7 |
67.0 |
41.3 |
18.3 |
8.0 |
37.7 |
Deepseek-Math-7B-RL |
68.0 |
83.3 |
44.3 |
33.0 |
23.0 |
50.3 |
InternLM2-Math-Plus-7B |
61.4 |
78.3 |
52.5 |
40.5 |
21.7 |
50.9 |
MiniCPM-2B |
49.3 |
51.7 |
18.0 |
8.7 |
3.7 |
26.3 |
InternLM2-Math-Plus-1.8B |
43.0 |
43.3 |
25.4 |
18.9 |
4.7 |
27.1 |
đ Citation and Tech Report
@misc{ying2024internlmmath,
title={InternLM-Math: Open Math Large Language Models Toward Verifiable Reasoning},
author={Huaiyuan Ying and Shuo Zhang and Linyang Li and Zhejian Zhou and Yunfan Shao and Zhaoye Fei and Yichuan Ma and Jiawei Hong and Kuikun Liu and Ziyi Wang and Yudong Wang and Zijian Wu and Shuaibin Li and Fengzhe Zhou and Hongwei Liu and Songyang Zhang and Wenwei Zhang and Hang Yan and Xipeng Qiu and Jiayu Wang and Kai Chen and Dahua Lin},
year={2024},
eprint={2402.06332},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
đ License
The project uses the other
license.