Model Overview
Model Features
Model Capabilities
Use Cases
🚀 SmolLM3
SmolLM3 is a 3B parameter language model designed to push the boundaries of small models. It supports 6 languages, advanced reasoning and long context, offering strong performance at the 3B–4B scale.
🚀 Quick Start
The modeling code for SmolLM3 is available in transformers v4.53.0
, so make sure to upgrade your transformers version. You can also load the model with the latest vllm
which uses transformers as a backend.
pip install -U transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
checkpoint = "HuggingFaceTB/SmolLM3-3B"
device = "cuda" # for GPU usage or "cpu" for CPU usage
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
# for multiple GPUs install accelerate and do `model = AutoModelForCausalLM.from_pretrained(checkpoint, device_map="auto")`
model = AutoModelForCausalLM.from_pretrained(checkpoint).to(device)
inputs = tokenizer.encode("Gravity is", return_tensors="pt").to(device)
outputs = model.generate(inputs)
print(tokenizer.decode(outputs[0]))
For local inference, you can use llama.cpp
, ONNX
, MLX
and MLC
. You can find quantized checkpoints in this collection (https://huggingface.co/collections/HuggingFaceTB/smollm3-686d33c1fdffe8e635317e23).
Long context processing
The current config.json
is set for context length up to 65,536 tokens. To handle longer inputs (128k or 256k), we utilize YaRN you can change the max_position_embeddings
and rope_scaling` to:
{
...,
"rope_scaling": {
"factor": 2.0, #2x65536=131 072
"original_max_position_embeddings": 65536,
"type": "yarn"
}
}
✨ Features
- Instruct model optimized for hybrid reasoning: Specifically tuned for tasks requiring a combination of different types of reasoning.
- Fully open model: The model comes with open weights and full training details, including the public data mixture and training configs.
- Long context: Trained on 64k context and supports up to 128k tokens using YARN extrapolation.
- Multilingual: Natively supports 6 languages (English, French, Spanish, German, Italian, and Portuguese).
For more details refer to our blog post: https://hf.co/blog/smollm3
📦 Installation
pip install -U transformers
💻 Usage Examples
Basic Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
checkpoint = "HuggingFaceTB/SmolLM3-3B"
device = "cuda" # for GPU usage or "cpu" for CPU usage
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
# for multiple GPUs install accelerate and do `model = AutoModelForCausalLM.from_pretrained(checkpoint, device_map="auto")`
model = AutoModelForCausalLM.from_pretrained(checkpoint).to(device)
inputs = tokenizer.encode("Gravity is", return_tensors="pt").to(device)
outputs = model.generate(inputs)
print(tokenizer.decode(outputs[0]))
📚 Documentation
Model Summary
SmolLM3 is a 3B parameter language model designed to push the boundaries of small models. It supports 6 languages, advanced reasoning and long context. SmolLM3 is a fully open model that offers strong performance at the 3B–4B scale.
The model is a decoder-only transformer using GQA and NoPE, it was pretrained on 11.2T tokens with a staged curriculum of web, code, math and reasoning data. Post-training included midtraining on 140B reasoning tokens followed by supervised fine-tuning and alignment via Anchored Preference Optimization (APO).
Evaluation
In this section, we report the evaluation results of SmolLM3 model. All evaluations are zero-shot unless stated otherwise, and we use lighteval to run them.
We highlight the best score in bold and underline the second-best score.
Base Pre-Trained Model
English benchmarks
Note: All evaluations are zero-shot unless stated otherwise. For Ruler 64k evaluation, we apply YaRN to the Qwen models with 32k context to extrapolate the context length.
Category | Metric | SmolLM3-3B | Qwen2.5-3B | Llama3-3.2B | Qwen3-1.7B-Base | Qwen3-4B-Base |
---|---|---|---|---|---|---|
Reasoning & Commonsense | HellaSwag | 76.15 | 74.19 | 75.52 | 60.52 | 74.37 |
ARC-CF (Average) | 65.61 | 59.81 | 58.58 | 55.88 | 62.11 | |
Winogrande | 58.88 | 61.41 | 58.72 | 57.06 | 59.59 | |
CommonsenseQA | 55.28 | 49.14 | 60.60 | 48.98 | 52.99 | |
Knowledge & Understanding | MMLU-CF (Average) | 44.13 | 42.93 | 41.32 | 39.11 | 47.65 |
MMLU Pro CF | 19.61 | 16.66 | 16.42 | 18.04 | 24.92 | |
MMLU Pro MCF | 32.70 | 31.32 | 25.07 | 30.39 | 41.07 | |
PIQA | 78.89 | 78.35 | 78.51 | 75.35 | 77.58 | |
OpenBookQA | 40.60 | 40.20 | 42.00 | 36.40 | 42.40 | |
BoolQ | 78.99 | 73.61 | 75.33 | 74.46 | 74.28 | |
Math & Code | ||||||
Coding & math | HumanEval+ | 30.48 | 34.14 | 25.00 | 43.29 | 54.87 |
MBPP+ | 52.91 | 52.11 | 38.88 | 59.25 | 63.75 | |
MATH (4-shot) | 46.10 | 40.10 | 7.44 | 41.64 | 51.20 | |
GSM8k (5-shot) | 67.63 | 70.13 | 25.92 | 65.88 | 74.14 | |
Long context | ||||||
Ruler 32k | 76.35 | 75.93 | 77.58 | 70.63 | 83.98 | |
Ruler 64k | 67.85 | 64.90 | 72.93 | 57.18 | 60.29 | |
Ruler 128k | 61.03 | 62.23 | 71.30 | 43.03 | 47.23 |
Multilingual benchmarks
Category | Metric | SmolLM3 3B Base | Qwen2.5-3B | Llama3.2 3B | Qwen3 1.7B Base | Qwen3 4B Base |
---|---|---|---|---|---|---|
Main supported languages | ||||||
French | MLMM Hellaswag | 63.94 | 57.47 | 57.66 | 51.26 | 61.00 |
Belebele | 51.00 | 51.55 | 49.22 | 49.44 | 55.00 | |
Global MMLU (CF) | 38.37 | 34.22 | 33.71 | 34.94 | 41.80 | |
Flores-200 (5-shot) | 62.85 | 61.38 | 62.89 | 58.68 | 65.76 | |
Spanish | MLMM Hellaswag | 65.85 | 58.25 | 59.39 | 52.40 | 61.85 |
Belebele | 47.00 | 48.88 | 47.00 | 47.56 | 50.33 | |
Global MMLU (CF) | 38.51 | 35.84 | 35.60 | 34.79 | 41.22 | |
Flores-200 (5-shot) | 48.25 | 50.00 | 44.45 | 46.93 | 50.16 | |
German | MLMM Hellaswag | 59.56 | 49.99 | 53.19 | 46.10 | 56.43 |
Belebele | 48.44 | 47.88 | 46.22 | 48.00 | 53.44 | |
Global MMLU (CF) | 35.10 | 33.19 | 32.60 | 32.73 | 38.70 | |
Flores-200 (5-shot) | 56.60 | 50.63 | 54.95 | 52.58 | 50.48 | |
Italian | MLMM Hellaswag | 62.49 | 53.21 | 54.96 | 48.72 | 58.76 |
Belebele | 46.44 | 44.77 | 43.88 | 44.00 | 48.78 | |
Global MMLU (CF) | 36.99 | 33.91 | 32.79 | 35.37 | 39.26 | |
Flores-200 (5-shot) | 52.65 | 54.87 | 48.83 | 48.37 | 49.11 | |
Portuguese | MLMM Hellaswag | 63.22 | 57.38 | 56.84 | 50.73 | 59.89 |
Belebele | 47.67 | 49.22 | 45.00 | 44.00 | 50.00 | |
Global MMLU (CF) | 36.88 | 34.72 | 33.05 | 35.26 | 40.66 | |
Flores-200 (5-shot) | 60.93 | 57.68 | 54.28 | 56.58 | 63.43 |
The model has also been trained on Arabic (standard), Chinese and Russian data, but has seen fewer tokens in these languages compared to the 6 above. We report the performance on these langages for information.
Category | Metric | SmolLM3 3B Base | Qwen2.5-3B | Llama3.2 3B | Qwen3 1.7B Base | Qwen3 4B Base |
---|---|---|---|---|---|---|
Other supported languages | ||||||
Arabic | Belebele | 40.22 | 44.22 | 45.33 | 42.33 | 51.78 |
Global MMLU (CF) | 28.57 | 28.81 | 27.67 | 29.37 | 31.85 | |
Flores-200 (5-shot) | 40.22 | 39.44 | 44.43 | 35.82 | 39.76 | |
Chinese | Belebele | 43.78 | 44.56 | 49.56 | 48.78 | 53.22 |
Global MMLU (CF) | 36.16 | 33.79 | 39.57 | 38.56 | 44.55 | |
Flores-200 (5-shot) | 29.17 | 33.21 | 31.89 | 25.70 | 32.50 | |
Russian | Belebele | 47.44 | 45.89 | 47.44 | 45.22 | 51.44 |
Global MMLU (CF) | 36.51 | 32.47 | 34.52 | 34.83 | 38.80 | |
Flores-200 (5-shot) | 47.13 | 48.74 | 50.74 | 54.70 | 60.53 |
Instruction Model
No Extended Thinking
Evaluation results of non reasoning models and reasoning models in no thinking mode. We highlight the best and second-best scores in bold.
Category | Metric | SmoLLM3-3B | Qwen2.5-3B | Llama3.1-3B | Qwen3-1.7B | Qwen3-4B |
---|---|---|---|---|---|---|
High school math competition | AIME 2025 | 9.3 | 2.9 | 0.3 | 8.0 | 17.1 |
Math problem-solving | GSM-Plus | 72.8 | 74.1 | 59.2 | 68.3 | 82.1 |
Competitive programming | LiveCodeBench v4 | 15.2 | 10.5 | 3.4 | 15.0 | 24.9 |
Graduate-level reasoning | GPQA Diamond | 35.7 | 32.2 | 29.4 | 31.8 | 44.4 |
Instruction following | IFEval | 76.7 | 65.6 | 71.6 | 74.0 | 68.9 |
Alignment | MixEval Hard | 26.9 | 27.6 | 24.9 | 24.3 | 31.6 |
Tool Calling | BFCL | 92.3 | - | 92.3 * | 89.5 | 95.0 |
Multilingual Q&A | Global MMLU | 53.5 | 50.54 | 46.8 | 49.5 | 65.1 |
(*): this is a tool calling finetune
Extended Thinking
Evaluation results in reasoning mode for SmolLM3 and Qwen3 models:
Category | Metric | SmoLLM3-3B | Qwen3-1.7B | Qwen3-4B |
---|---|---|---|---|
High school math competition | AIME 2025 | 36.7 | 30.7 | 58.8 |
Math problem-solving | GSM-Plus | 83.4 | 79.4 | 88.2 |
Competitive programming | LiveCodeBench v4 | 30.0 | 34.4 | 52.9 |
Graduate-level reasoning | GPQA Diamond | 41.7 | 39.9 | 55.3 |
Instruction following | IFEval | 71.2 | 74.2 | 85.4 |
Alignment | MixEval Hard | 30.8 | 33.9 | 38.0 |
Tool Calling | BFCL | 88.8 | 88.8 | 95.5 |
Multilingual Q&A | Global MMLU | 64.1 | 62.3 | 73.3 |
Training
Model
- Architecture: Transformer decoder
- Pretraining tokens: 11T
- Precision: bfloat16
Software & hardware
- GPUs: 384 H100
- Training Framework: nanotron
- Data processing framework: datatrove
- Evaluation framework: lighteval
- Post-training Framework: TRL
Open resources
Here is an infographic with all the training details.
- The datasets used for pretraining can be found in this collection and those used in mid-training and post-training will be released in the following weeks
- The training and evaluation configs and code can be found in the huggingface/smollm repository.
🔧 Technical Details
The model is a decoder-only transformer using GQA and NoPE, it was pretrained on 11.2T tokens with a staged curriculum of web, code, math and reasoning data. Post-training included midtraining on 140B reasoning tokens followed by supervised fine-tuning and alignment via Anchored Preference Optimization (APO).
📄 License

