đ Doge 320M
Doge 320M is a language model that uses Dynamic Mask Attention for sequence transformation. It can leverage Multi - Layer Perceptron or Cross Domain Mixture of Experts for state transformation. This model is trained by the SmallDoge community.
đ Quick Start
Doge uses Dynamic Mask Attention as sequence transformation and can use Multi - Layer Perceptron or Cross Domain Mixture of Experts as state transformation. Dynamic Mask Attention allows the Transformer to use self - attention during training and state space during inference, and Cross Domain Mixture of Experts can directly inherit the weights of Multi - Layer Perceptron for further training. This model is trained by SmallDoge community. For detailed algorithm and model architecture, a paper is coming soon. All training details and code are available in the [small - doge](https://github.com/SmallDoges/small - doge) repository.
đģ Usage Examples
Basic Usage
>>> from transformers import AutoTokenizer, AutoModelForCausalLM
>>> tokenizer = AutoTokenizer.from_pretrained("SmallDoge/Doge-320M")
>>> model = AutoModelForCausalLM.from_pretrained("SmallDoge/Doge-320M", trust_remote_code=True)
>>> inputs = tokenizer("Hey how are you doing?", return_tensors="pt")
>>> out = model.generate(**inputs, max_new_tokens=100)
>>> print(tokenizer.batch_decode(out))
đ Documentation
Model Details
We build the Doge by doing Per - Training on [Smollm - Corpus](https://huggingface.co/datasets/HuggingFaceTB/smollm - corpus). If you want to continue pre - training this model, you can find the unconverged checkpoint [here](https://huggingface.co/SmallDoge/Doge - 320M - checkpoint). These models have not been fine - tuned for instruction, the instruction model is [here](https://huggingface.co/SmallDoge/Doge - 320M - Instruct).
Pre - Training:
Property |
Details |
Model |
Doge-20M, Doge-60M, Doge-160M, Doge-320M |
Training Data |
smollm-corpus |
Steps |
8k, 16k, 24k, 32k |
Content Length |
2048 |
Tokens |
4B, 16B, 32B, 64B |
LR |
8e - 3, 6e - 3, 4e - 3, 2e - 3 |
Batch Size |
0.5M, 1M, 1.5M, 2M |
Precision |
bfloat16 |
RTX 4090 GPU hours |
14, 128, 522, 1856 |
Evaluation:
Property |
Details |
Model |
Doge-20M, Doge-60M, Doge-160M, Doge-320M |
MMLU |
25.4, 26.4, 29.2, 35.6 |
TriviaQA |
0.03, 0.2, 4.8, 9.4 |
ARC |
29.8, 37.9, 44.4, 55.4 |
PIQA |
58.4, 61.4, 70.1, 73.9 |
HellaSwag |
27.3, 31.5, 43.4, 52.7 |
OBQA |
25.6, 28.0, 34.4, 37.9 |
Winogrande |
50.2, 50.8, 52.2, 59.3 |
tokens / s on i7 - 11 CPU |
142, 62, 28, 16 |
Procedure:

Environment:
- Image: nvcr.io/nvidia/pytorch:24.12 - py3
- Hardware: 1x NVIDIA RTX 4090
- Software: Transformers
đ License
This project is licensed under the Apache - 2.0 license.
đ Citation
@misc{smalldoges,
title={SmallDoges: A Family of Dynamic UltraFast Small Language Models},
author={Jingze, Shi and Yifan, Wu and Bingheng, Wu and Yuyu, Luo},
year={2025},
month={March},
url={https://github.com/SmallDoges/small-doge}
}