đ Doge 160M
Doge 160M is a language model that uses Dynamic Mask Attention for sequence transformation. It offers flexibility in state transformation, supporting both Multi - Layer Perceptron and Cross Domain Mixture of Experts. This model is trained by the SmallDoge community, with all training details and code available in the small - doge repository.
đ Quick Start
Doge uses Dynamic Mask Attention as sequence transformation and can use Multi - Layer Perceptron or Cross Domain Mixture of Experts as state transformation. Dynamic Mask Attention allows the Transformer to use self - attention during training and state space during inference, and Cross Domain Mixture of Experts can directly inherit the weights of Multi - Layer Perceptron for further training.
đģ Usage Examples
Basic Usage
>>> from transformers import AutoTokenizer, AutoModelForCausalLM
>>> tokenizer = AutoTokenizer.from_pretrained("SmallDoge/Doge-160M")
>>> model = AutoModelForCausalLM.from_pretrained("SmallDoge/Doge-160M", trust_remote_code=True)
>>> inputs = tokenizer("Hey how are you doing?", return_tensors="pt")
>>> out = model.generate(**inputs, max_new_tokens=100)
>>> print(tokenizer.batch_decode(out))
đ Documentation
We build the Doge by doing Per - Training on Smollm - Corpus. If you want to continue pre - training this model, you can find the unconverged checkpoint here. These models have not been fine - tuned for instruction, the instruction model is here.
Pre - Training
Property |
Details |
Doge-20M |
Training Data: smollm-corpus, Steps: 8k, Content Length: 2048, Tokens: 4B, LR: 8e-3, Batch Size: 0.5M, Precision: bfloat16, RTX 4090 GPU hours: 14 |
Doge-60M |
Training Data: smollm-corpus, Steps: 16k, Content Length: 2048, Tokens: 16B, LR: 6e-3, Batch Size: 1M, Precision: bfloat16, RTX 4090 GPU hours: 128 |
Doge-160M |
Training Data: smollm-corpus, Steps: 24k, Content Length: 2048, Tokens: 32B, LR: 4e-3, Batch Size: 1.5M, Precision: bfloat16, RTX 4090 GPU hours: 522 |
Doge-320M |
Training Data: smollm-corpus, Steps: 32k, Content Length: 2048, Tokens: 64B, LR: 2e-3, Batch Size: 2M, Precision: bfloat16, RTX 4090 GPU hours: 1856 |
Evaluation
Property |
Details |
Doge-20M |
MMLU: 25.4, TriviaQA: 0.03, ARC: 29.8, PIQA: 58.4, HellaSwag: 27.3, OBQA: 25.6, Winogrande: 50.2, tokens / s on i7 - 11 CPU: 142 |
Doge-60M |
MMLU: 26.4, TriviaQA: 0.2, ARC: 37.9, PIQA: 61.4, HellaSwag: 31.5, OBQA: 28.0, Winogrande: 50.8, tokens / s on i7 - 11 CPU: 62 |
Doge-160M |
MMLU: 29.2, TriviaQA: 4.8, ARC: 44.4, PIQA: 70.1, HellaSwag: 43.4, OBQA: 34.4, Winogrande: 52.2, tokens / s on i7 - 11 CPU: 28 |
Doge-320M |
MMLU: 35.6, TriviaQA: 9.4, ARC: 55.4, PIQA: 73.9, HellaSwag: 52.7, OBQA: 37.9, Winogrande: 59.3, tokens / s on i7 - 11 CPU: 16 |
Procedure

Environment
- Image: nvcr.io/nvidia/pytorch:24.12 - py3
- Hardware: 1x NVIDIA RTX 4090
- Software: Transformers
đ License
This project is licensed under the Apache - 2.0 license.
đ Citation
@misc{smalldoges,
title={SmallDoges: A Family of Dynamic UltraFast Small Language Models},
author={Jingze, Shi and Yifan, Wu and Bingheng, Wu and Yuyu, Luo},
year={2025},
month={March},
url={https://github.com/SmallDoges/small-doge}
}