Doge-160M-Reason-Distill: An Open-Source Lightweight Language Model Focused on Inference and Q&A Tasks

Doge 160M Reason Distill

Developed by SmallDoge

Doge 160M Reasoning Distilled Version is a lightweight language model based on dynamic masked attention mechanism and cross-domain mixture of experts, focusing on reasoning and question-answering tasks.

Large Language Model

Transformers

EnglishOpen Source License:Apache-2.0 #Dynamic Masked Attention #Inference Distillation Optimization #Long-chain Thinking Assistance

Downloads 26

Release Time : 2/18/2025

Model Overview

This model employs dynamic masked attention mechanism for sequence transformation and can optionally use multi-layer perceptron or cross-domain mixture of experts for state transition. The dynamic masked attention mechanism enables the Transformer to use self-attention during training and switch to state-space mechanism during inference.

Model Features

Dynamic Masked Attention Mechanism

Allows the use of self-attention during training and switching to state-space mechanism during inference, improving inference efficiency.

Cross-domain Mixture of Experts

Can directly inherit weights from multi-layer perceptron for subsequent training, enhancing model adaptability.

Reasoning Distillation

Supervised fine-tuning on the Reason-Distill dataset to optimize reasoning capabilities.

Model Capabilities

Question Answering Generation

Logical Reasoning

Mathematical Problem Solving

Use Cases

Education

Mathematical Problem Solving

Solving basic mathematical comparison and calculation problems

Can correctly compare number sizes and provide reasoning processes

Intelligent Assistant

Systematic Problem Solving

Providing detailed thinking processes and solutions in a specific format

Can generate structured thinking processes and final solutions

🚀 Doge 160M Reason Distill

Doge 160M Reason Distill is a model that uses Dynamic Mask Attention for sequence transformation and can choose different state transformation methods. It's trained by the SmallDoge community, offering a new approach to language processing.

✨ Features

Doge uses Dynamic Mask Attention as sequence transformation and can use Multi-Layer Perceptron or Cross Domain Mixture of Experts as state transformation. Dynamic Mask Attention allows the Transformer to use self-attention during training and state space during inference, and Cross Domain Mixture of Experts can directly inherit the weights of Multi-Layer Perceptron for further training. This model is trained by SmallDoge community, for detailed algorithm and model architecture, please refer to Wonderful Matrices, all training details and code are publicly available on the small-doge repository.

💻 Usage Examples

Basic Usage

from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig, TextStreamer

tokenizer = AutoTokenizer.from_pretrained("SmallDoge/Doge-160M-Reason-Distill")
model = AutoModelForCausalLM.from_pretrained("SmallDoge/Doge-160M-Reason-Distill", trust_remote_code=True)

generation_config = GenerationConfig(
      max_new_tokens=100, 
      use_cache=True, 
      do_sample=True, 
      temperature=0.8, 
      top_p=0.9,
      repetition_penalty=1.0
)
steamer = TextStreamer(
      tokenizer=tokenizer, 
      skip_prompt=True
)

system_prompt = """
Your role as an assistant involves thoroughly exploring questions through a systematic long thinking process before providing the final precise and accurate solutions. This requires engaging in a comprehensive cycle of analysis, summarizing, exploration, reassessment, reflection, backtracing, and iteration to develop well-considered thinking process. Please structure your response into two main sections: Thought and Solution. In the Thought section, detail your reasoning process using the specified format: <|begin_of_thought|> {thought with steps separated with '\n\n'} <|end_of_thought|> Each step should include detailed considerations such as analisying questions, summarizing relevant findings, brainstorming new ideas, verifying the accuracy of the current steps, refining any errors, and revisiting previous steps. In the Solution section, based on various attempts, explorations, and reflections from the Thought section, systematically present the final solution that you deem correct. The solution should remain a logical, accurate, concise expression style and detail necessary step needed to reach the conclusion, formatted as follows: <|begin_of_solution|> {final formatted, precise, and clear solution} <|end_of_solution|> Now, try to solve the following question through the above guidelines:
""".strip()
prompt = "Which number is bigger, 3.9 or 3.11?"
conversation = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": prompt}
]
inputs = tokenizer.apply_chat_template(
    conversation=conversation,
    tokenize=True,
    return_tensors="pt",
)

outputs = model.generate(
    inputs, 
    tokenizer=tokenizer,
    generation_config=generation_config, 
    streamer=steamer
)

📚 Documentation

We build the Doge-Reason-Distill by SFT on Reason-Distill.

⚠️ Important Note

The larger model is under training and will be uploaded soon.

SFT Training Details

Property	Details
Model	Doge-160M-Reason-Distil
Training Data	SmallDoge/Reason-Distill
Epochs	2
Content Length	4096
LR	4e-4
Batch Size	0.5M
Precision	bfloat16

Training Procedure

SFT:

Training Environment

Image: nvcr.io/nvidia/pytorch:24.12-py3
Hardware: 1x NVIDIA RTX 4090
Software: Transformers, TRL

📄 License

This project is licensed under the Apache-2.0 license.

📚 Citation

@misc{shi2024wonderfulmatrices,
      title={Wonderful Matrices: Combining for a More Efficient and Effective Foundation Model Architecture}, 
      author={Jingze Shi and Bingheng Wu},
      year={2024},
      eprint={2412.11834},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2412.11834}, 
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご