đ MiniMax-Text-01
MiniMax-Text-01 is a high - performance language model with a large number of parameters. It adopts a hybrid architecture and advanced parallel strategies to achieve long - context processing capabilities and excellent performance on various benchmarks.
đ Quick Start
For detailed usage and deployment of MiniMax-Text-01, please refer to the official documentation and relevant guides. You can access the official website, API platform, and other resources through the following links:
⨠Features
- Large - scale Parameters: MiniMax-Text-01 has 456 billion total parameters, with 45.9 billion activated per token, providing strong language understanding and generation capabilities.
- Hybrid Architecture: It combines Lightning Attention, Softmax Attention, and Mixture - of - Experts (MoE) to better unlock long - context capabilities.
- Extended Context Length: Leveraging advanced parallel strategies and innovative compute - communication overlap methods, the training context length is extended to 1 million tokens, and it can handle up to 4 million tokens during inference.
- Top - tier Performance: Demonstrates excellent performance on various academic benchmarks.
đ Documentation
1. Model Architecture
The architecture of MiniMax-Text-01 is briefly described as follows:
Property |
Details |
Total Parameters |
456B |
Activated Parameters per Token |
45.9B |
Number of Layers |
80 |
Hybrid Attention |
A softmax attention is positioned after every 7 lightning attention. - Number of attention heads: 64 - Attention head dimension: 128 |
Mixture of Experts |
- Number of experts: 32 - Expert hidden dimension: 9216 - Top - 2 routing strategy |
Positional Encoding |
Rotary Position Embedding (RoPE) applied to half of the attention head dimension with a base frequency of 10,000,000 |
Hidden Size |
6144 |
Vocab Size |
200,064 |
2. Evaluation
Core Academic Benchmarks
Tasks |
GPT - 4o (11 - 20) |
Claude - 3.5 - Sonnet (10 - 22) |
Gemini - 1.5 - Pro (002) |
Gemini - 2.0 - Flash (exp) |
Qwen2.5 - 72B - Inst. |
DeepSeek - V3 |
Llama - 3.1 - 405B - Inst. |
MiniMax - Text - 01 |
General |
|
|
|
|
|
|
|
|
MMLU* |
85.7 |
88.3 |
86.8 |
86.5 |
86.1 |
88.5 |
88.6 |
88.5 |
MMLU - Pro* |
74.4 |
78.0 |
75.8 |
76.4 |
71.1 |
75.9 |
73.3 |
75.7 |
SimpleQA |
39.0 |
28.1 |
23.4 |
26.6 |
10.3 |
24.9 |
23.2 |
23.7 |
C - SimpleQA |
64.6 |
56.8 |
59.4 |
63.3 |
52.2 |
64.8 |
54.7 |
67.4 |
IFEval (avg) |
84.1 |
90.1 |
89.4 |
88.4 |
87.2 |
87.3 |
86.4 |
89.1 |
Arena - Hard |
92.4 |
87.6 |
85.3 |
72.7 |
81.2 |
91.4 |
63.5 |
89.1 |
Reasoning |
|
|
|
|
|
|
|
|
GPQA* (diamond) |
46.0 |
65.0 |
59.1 |
62.1 |
49.0 |
59.1 |
50.7 |
54.4 |
DROP* (F1) |
89.2 |
88.8 |
89.2 |
89.3 |
85.0 |
91.0 |
92.5 |
87.8 |
Mathematics |
|
|
|
|
|
|
|
|
GSM8k* |
95.6 |
96.9 |
95.2 |
95.4 |
95.8 |
96.7 |
96.7 |
94.8 |
MATH* |
76.6 |
74.1 |
84.6 |
83.9 |
81.8 |
84.6 |
73.8 |
77.4 |
Coding |
|
|
|
|
|
|
|
|
MBPP + |
76.2 |
75.1 |
75.4 |
75.9 |
77.0 |
78.8 |
73.0 |
71.7 |
HumanEval |
90.2 |
93.7 |
86.6 |
89.6 |
86.6 |
92.1 |
89.0 |
86.9 |
* Evaluated following a 0 - shot CoT setting.
Long Benchmarks
4M Needle In A Haystack Test
Ruler
Model |
4k |
8k |
16k |
32k |
64k |
128k |
256k |
512k |
1M |
GPT - 4o (11 - 20) |
0.970 |
0.921 |
0.890 |
0.888 |
0.884 |
- |
- |
- |
- |
Claude - 3.5 - Sonnet (10 - 22) |
0.965 |
0.960 |
0.957 |
0.950 |
0.952 |
0.938 |
- |
- |
- |
Gemini - 1.5 - Pro (002) |
0.962 |
0.960 |
0.960 |
0.958 |
0.938 |
0.917 |
0.916 |
0.861 |
0.850 |
Gemini - 2.0 - Flash (exp) |
0.960 |
0.960 |
0.951 |
0.957 |
0.937 |
0.860 |
0.797 |
0.709 |
- |
MiniMax - Text - 01 |
0.963 |
0.961 |
... |
... |
... |
... |
... |
... |
... |
đ License