🚀 SeaLLM-7B-v2
SeaLLM-7B-v2 is a state-of-the-art multilingual large language model tailored for Southeast Asian languages. It offers high performance across diverse tasks, including world knowledge, math reasoning, and instruction following.
✨ Features
- Impressive Math Reasoning: Achieves the 7B-SOTA on the Zero-shot CoT GSM8K task with a score of 78.2. Outperforms GPT-3.5 in many GSM8K-translated tasks in SEA languages (🇨🇳 🇻🇳 🇮🇩 🇹🇭) and MGSM (🇨🇳 🇹🇭). Also surpasses GPT-3.5 in MATH CoT for Thai 🇹🇭.
- Strong Commonsense Reasoning: Scores competitively against GPT-3.5 in many zero-shot CoT commonsense benchmarks, with scores of 82.5, 68.3, and 80.9 on Arc-C, Winogrande, and Hellaswag respectively.
- High MT-bench Score: Achieves a score of 7.54 on the 🇬🇧 MT-bench, ranking 3rd on the leaderboard for the 7B category and being the most outperforming multilingual model.
- Competitive in Vietnamese: Scores 45.74 on the VMLU benchmark for Vietnamese 🇻🇳, and is the only open-source multilingual model that can compete with monolingual models of similar sizes.
🚀 Quick Start
We introduce SeaLLM-7B-v2, the state-of-the-art multilingual LLM for Southeast Asian (SEA) languages 🇬🇧 🇨🇳 🇻🇳 🇮🇩 🇹🇭 🇲🇾 🇰🇭 🇱🇦 🇲🇲 🇵🇭. It is the most significant upgrade since SeaLLM-13B, with half the size, outperforming in diverse multilingual tasks such as world knowledge, math reasoning, and instruction following.
Release and DEMO
⚠️ Important Note
By using our released weights, codes, and demos, you agree to and comply with the terms and conditions specified in our SeaLLMs Terms Of Use.
💡 Usage Tip
We must note that even though the weights, codes, and demos are released in an open manner, similar to other pre-trained language models, and despite our best efforts in red teaming and safety fine-tuning and enforcement, our models come with potential risks, including but not limited to inaccurate, misleading or potentially harmful generation. Developers and stakeholders should perform their own red teaming and provide related security measures before deployment, and they must abide by and comply with local governance and regulations. In no event shall the authors be held liable for any claim, damages, or other liability arising from the use of the released weights, codes, or demos.
📚 Documentation
What's new since SeaLLM-13B-v1 and SeaLLM-7B-v1?
SeaLLM-7B-v2 is continue-pretrained from Mistral-7B and underwent carefully designed tuning with a focus on reasoning.
🔧 Technical Details
Evaluation
Zero-shot CoT Multilingual Math Reasoning
SeaLLM-7B-v2 achieves a score of 78.2 on the GSM8K with zero-shot CoT reasoning, making it the state of the art in the realm of 7B models. It also outperforms GPT-3.5 in the same GSM8K benchmark when translated into SEA languages (🇨🇳 🇻🇳 🇮🇩 🇹🇭). SeaLLM-7B-v2 also surpasses GPT-3.5 on the Thai-translated MATH benchmark, with scores of 22.4 vs 18.1.

See details on English and translated GSM8K and MATH with zero-shot reasoning
Model |
GSM8K en |
MATH en |
GSM8K zh |
MATH zh |
GSM8K vi |
MATH vi |
GSM8K id |
MATH id |
GSM8K th |
MATH th |
GPT-3.5 |
80.8 |
34.1 |
48.2 |
21.5 |
55 |
26.5 |
64.3 |
26.4 |
35.8 |
18.1 |
Qwen-14B-chat |
61.4 |
18.4 |
41.6 |
11.8 |
33.6 |
3.6 |
44.7 |
8.6 |
22 |
6 |
Vistral-7b-chat |
48.2 |
12.5 |
|
|
48.7 |
3.1 |
|
|
|
|
Qwen1.5-7B-chat |
56.8 |
15.3 |
40 |
2.7 |
37.7 |
9 |
36.9 |
7.7 |
21.9 |
|
SeaLLM-7B-v2 |
78.2 |
27.5 |
53.7 |
17.6 |
69.9 |
23.8 |
71.5 |
24.4 |
59.6 |
22.4 |
Baselines were evaluated using their respective chat-template and system prompts (Qwen1.5-7B-chat, Vistral).
Zero-shot MGSM
SeaLLM-7B-v2 also outperforms GPT-3.5 and Qwen-14B on the multilingual MGSM for Zh and Th.
Model |
MGSM-Zh |
MGSM-Th |
ChatGPT (reported) |
61.2 |
47.2 |
Qwen-14B-chat |
59.6 |
28 |
SeaLLM-7B-v2 |
64.8 |
62.4 |
Zero-shot Commonsense Reasoning
We compare SeaLLM-7B-v2 with ChatGPT and Mistral-7B-instruct on various zero-shot commonsense benchmarks (Arc-Challenge, Winogrande, and Hellaswag). We use the 2-stage technique in (Kojima et al., 2023) to obtain the answer. Note that we DID NOT use "Let's think step-by-step" to invoke explicit CoT.
0-shot reasoning |
Arc-Challenge |
Winogrande |
Hellaswag |
ChatGPT (reported) |
84.6* |
66.8* |
72.0* |
ChatGPT (reproduced) |
84.1 |
63.1 |
79.5 |
Mistral-7B-Instruct |
68.1 |
56.4 |
45.6 |
Qwen1.5-7B-chat |
79.3 |
59.4 |
69.3 |
SeaLLM-7B-v2 |
82.5 |
68.3 |
80.9 |
Baselines were evaluated using their respective chat-template and system prompts (Qwen1.5-7B-chat, Mistral).
Multilingual World Knowledge
We evaluate models on 3 benchmarks following the recommended default setups: 5-shot MMLU for En, 3-shot M3Exam (M3e) for En, Zh, Vi, Id, Th, and zero-shot VMLU for Vi.
Model |
Langs |
En MMLU |
En M3e |
Zh M3e |
Vi M3e |
Vi VMLU |
Id M3e |
Th M3e |
GPT-3.5 |
Multi |
68.90 |
75.46 |
60.20 |
58.64 |
46.32 |
49.27 |
37.41 |
Vistral-7B-chat |
Mono |
56.86 |
67.00 |
44.56 |
54.33 |
50.03 |
36.49 |
25.27 |
Qwen1.5-7B-chat |
Multi |
61.00 |
52.07 |
81.96 |
43.38 |
45.02 |
24.29 |
20.25 |
SeaLLM-7B-v2 |
Multi |
61.89 |
70.91 |
55.43 |
51.15 |
45.74 |
42.25 |
35.52 |
VMLU reproduce script here. Lm-eval was used to evaluate MMLU. 0-shot VMLU scores for baselines were evaluated using their respective chat-template and system prompts (Qwen1.5-7B-chat).
MT-Bench
On the English MT-bench metric, SeaLLM-7B-v2 achieves a score of 7.54 on the MT-bench (3rd place on the leaderboard for the 7B category), outperforms many 70B models, and is arguably the only one that handles 10 SEA languages.
Refer to mt_bench/seallm_7b_v2.jsonl for the MT-bench predictions of SeaLLM-7B-v2, and here to reproduce it.
Model |
Access |
Langs |
MT-Bench |
GPT-4-turbo |
closed |
multi |
9.32 |
GPT-4-0613 |
closed |
multi |
9.18 |
Mixtral-8x7b (46B) |
open |
multi |
8.3 |
Starling-LM-7B-alpha |
open |
mono (en) |
8.0 |
OpenChat-3.5-7B |
open |
mono (en) |
7.81 |
SeaLLM-7B-v2 |
open |
multi (10+) |
7.54 |
Qwen-14B |
open |
multi |
6.96 |
Llama-2-70B |
open |
mono (en) |
6.86 |
Mistral-7B-instuct |
open |
mono (en) |
6.84 |
Sea-Bench
Similar to MT-Bench, Sea-bench is a set of categorized instruction test sets to measure models' ability as an assistant, specifically focused on 9 SEA languages, including non-Latin low-resource languages.
As shown, the significant improvements come from math-reasoning, reaching the GPT-3.5 level of performance.

Refer to sea_bench/seallm_7b_v2.jsonl for the Sea-bench predictions of SeaLLM-7B-v2.
💻 Usage Examples
Basic Usage
prompt = """<|im_start|>system
You are a helpful assistant.</s><|im_start|>user
Hello world</s><|im_start|>assistant
Hi there, how can I help?</s>"""
print(tokenizer.convert_ids_to_tokens(tokenizer.encode(prompt)))
'<s>', '▁<', '|', 'im', '_', 'start', '|', '>', 'system', '<0x0A>', 'You', '▁are', '▁a', '▁helpful', '▁assistant', '.', '</s>', '▁<', '|', 'im', '_', 'start', '|', '>', 'user', '<0x0A>', 'Hello', '▁world', '</s>', '▁<', '|', 'im', '_', 'start', '|', '>', 'ass', 'istant', '<0x0A>', 'Hi', '▁there', ',', '▁how', '▁can', '▁I', '▁help', '?', '</s>']
"""
Advanced Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
device = "cuda"
model = AutoModelForCausalLM.from_pretrained("SeaLLMs/SeaLLM-7B-v2", torch_dtype=torch.bfloat16, device_map=device)
tokenizer = AutoTokenizer.from_pretrained("SeaLLMs/SeaLLM-7B-v2")
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hello world"},
{"role": "assistant", "content": "Hi there, how can I help you today?"},
{"role": "user", "content": "Explain general relativity in details."}
]
encodeds = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True)
print(tokenizer.convert_ids_to_tokens(encodeds[0]))
📄 License
The model is released under the SeaLLMs Terms Of Use.