Doge-160M Open-Source Small Language Model - Free Support for Multiple Text Generation Tasks

Doge 160M

Developed by SmallDoge

Doge 160M is a small language model that employs dynamic masked attention mechanisms, trained by the SmallDoge community, and supports text generation tasks.

Large Language Model

Transformers

Supports Multiple LanguagesOpen Source License:Apache-2.0 #Dynamic Masked Attention #Small Language Model #Efficient Inference

Downloads 4,227

Release Time : 2/15/2025

Model Overview

Doge 160M is a small language model based on the Transformer architecture, utilizing dynamic masked attention mechanisms for sequence transformation and employing multi-layer perceptrons or cross-domain mixture of experts for state transitions. The model is suitable for text generation tasks and performs excellently in multiple benchmark tests.

Model Features

Dynamic Masked Attention Mechanism

Enables the Transformer to use self-attention mechanisms during training and state space during inference, improving efficiency.

Cross-Domain Mixture of Experts

Can directly inherit weights from multi-layer perceptrons for further training, enhancing model performance.

Efficient Training

Training completes in just 522 hours on an RTX 4090 GPU, making it suitable for resource-limited environments.

Model Capabilities

Text Generation

Natural Language Processing

Use Cases

Text Generation

Dialogue Generation

Used to generate natural dialogue responses.

Performs well on benchmarks such as TriviaQA and HellaSwag.

Content Creation

Used to generate short text content, such as social media posts or brief articles.

🚀 Doge 160M

Doge 160M is a language model that uses Dynamic Mask Attention for sequence transformation. It offers flexibility in state transformation, supporting both Multi - Layer Perceptron and Cross Domain Mixture of Experts. This model is trained by the SmallDoge community, with all training details and code available in the small - doge repository.

🚀 Quick Start

Doge uses Dynamic Mask Attention as sequence transformation and can use Multi - Layer Perceptron or Cross Domain Mixture of Experts as state transformation. Dynamic Mask Attention allows the Transformer to use self - attention during training and state space during inference, and Cross Domain Mixture of Experts can directly inherit the weights of Multi - Layer Perceptron for further training.

💻 Usage Examples

Basic Usage

>>> from transformers import AutoTokenizer, AutoModelForCausalLM

>>> tokenizer = AutoTokenizer.from_pretrained("SmallDoge/Doge-160M")
>>> model = AutoModelForCausalLM.from_pretrained("SmallDoge/Doge-160M", trust_remote_code=True)
>>> inputs = tokenizer("Hey how are you doing?", return_tensors="pt")

>>> out = model.generate(**inputs, max_new_tokens=100)
>>> print(tokenizer.batch_decode(out))

📚 Documentation

We build the Doge by doing Per - Training on Smollm - Corpus. If you want to continue pre - training this model, you can find the unconverged checkpoint here. These models have not been fine - tuned for instruction, the instruction model is here.

Pre - Training

Property	Details
Doge-20M	Training Data: smollm-corpus, Steps: 8k, Content Length: 2048, Tokens: 4B, LR: 8e-3, Batch Size: 0.5M, Precision: bfloat16, RTX 4090 GPU hours: 14
Doge-60M	Training Data: smollm-corpus, Steps: 16k, Content Length: 2048, Tokens: 16B, LR: 6e-3, Batch Size: 1M, Precision: bfloat16, RTX 4090 GPU hours: 128
Doge-160M	Training Data: smollm-corpus, Steps: 24k, Content Length: 2048, Tokens: 32B, LR: 4e-3, Batch Size: 1.5M, Precision: bfloat16, RTX 4090 GPU hours: 522
Doge-320M	Training Data: smollm-corpus, Steps: 32k, Content Length: 2048, Tokens: 64B, LR: 2e-3, Batch Size: 2M, Precision: bfloat16, RTX 4090 GPU hours: 1856

Evaluation

Property	Details
Doge-20M	MMLU: 25.4, TriviaQA: 0.03, ARC: 29.8, PIQA: 58.4, HellaSwag: 27.3, OBQA: 25.6, Winogrande: 50.2, tokens / s on i7 - 11 CPU: 142
Doge-60M	MMLU: 26.4, TriviaQA: 0.2, ARC: 37.9, PIQA: 61.4, HellaSwag: 31.5, OBQA: 28.0, Winogrande: 50.8, tokens / s on i7 - 11 CPU: 62
Doge-160M	MMLU: 29.2, TriviaQA: 4.8, ARC: 44.4, PIQA: 70.1, HellaSwag: 43.4, OBQA: 34.4, Winogrande: 52.2, tokens / s on i7 - 11 CPU: 28
Doge-320M	MMLU: 35.6, TriviaQA: 9.4, ARC: 55.4, PIQA: 73.9, HellaSwag: 52.7, OBQA: 37.9, Winogrande: 59.3, tokens / s on i7 - 11 CPU: 16

Procedure

Environment

Image: nvcr.io/nvidia/pytorch:24.12 - py3
Hardware: 1x NVIDIA RTX 4090
Software: Transformers

📄 License

This project is licensed under the Apache - 2.0 license.

📚 Citation

@misc{smalldoges,
  title={SmallDoges: A Family of Dynamic UltraFast Small Language Models}, 
  author={Jingze, Shi and Yifan, Wu and Bingheng, Wu and Yuyu, Luo},
  year={2025},
  month={March},
  url={https://github.com/SmallDoges/small-doge}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご