Doge-160M-Reason-Distill开源轻量级语言模型

首页

Doge 160M Reason Distill

由 SmallDoge 开发

Doge 160M 推理蒸馏版是一个基于动态掩码注意力机制和跨域混合专家的轻量级语言模型，专注于推理和问答任务。

大型语言模型

Transformers

英语开源协议:Apache-2.0 #动态掩码注意力 #推理蒸馏优化 #长链思考辅助

下载量 26

发布时间 : 2/18/2025

模型简介

该模型采用动态掩码注意力机制进行序列变换，可选择多层感知机或跨域混合专家进行状态转换。动态掩码注意力机制使Transformer能在训练时使用自注意力机制，在推理时切换为状态空间机制。

模型特点

动态掩码注意力机制

允许在训练时使用自注意力机制，在推理时切换为状态空间机制，提高推理效率。

跨域混合专家

可直接继承多层感知机的权重进行后续训练，提高模型适应性。

推理蒸馏

在Reason-Distill数据集上进行监督微调，优化推理能力。

模型能力

问答生成

逻辑推理

数学问题解答

使用案例

教育

数学问题解答

解答基础数学比较和计算问题

能正确比较数字大小并提供推理过程

智能助手

系统化问题解答

按照特定格式提供详细思考过程和解决方案

能生成结构化的思考过程和最终解决方案

🚀 Doge 160M Reason Distill

Doge 160M Reason Distill模型使用动态掩码注意力进行序列转换，可采用多层感知器或跨域专家混合进行状态转换。它由SmallDoge社区训练，能有效处理问答任务。详细算法和模型架构可参考相关论文，所有训练细节和代码均在GitHub仓库公开。

🚀 快速开始

Doge使用动态掩码注意力进行序列转换，并可以使用多层感知器或跨域专家混合作为状态转换。动态掩码注意力允许Transformer在训练期间使用自注意力，在推理期间使用状态空间，而跨域专家混合可以直接继承多层感知器的权重以进行进一步训练。此模型由SmallDoge社区训练，有关详细的算法和模型架构，请参考Wonderful Matrices，所有训练细节和代码都可在small-doge仓库中公开获取。

💻 使用示例

基础用法

from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig, TextStreamer

tokenizer = AutoTokenizer.from_pretrained("SmallDoge/Doge-160M-Reason-Distill")
model = AutoModelForCausalLM.from_pretrained("SmallDoge/Doge-160M-Reason-Distill", trust_remote_code=True)

generation_config = GenerationConfig(
      max_new_tokens=100, 
      use_cache=True, 
      do_sample=True, 
      temperature=0.8, 
      top_p=0.9,
      repetition_penalty=1.0
)
steamer = TextStreamer(
      tokenizer=tokenizer, 
      skip_prompt=True
)

system_prompt = """
Your role as an assistant involves thoroughly exploring questions through a systematic long thinking process before providing the final precise and accurate solutions. This requires engaging in a comprehensive cycle of analysis, summarizing, exploration, reassessment, reflection, backtracing, and iteration to develop well-considered thinking process. Please structure your response into two main sections: Thought and Solution. In the Thought section, detail your reasoning process using the specified format: <|begin_of_thought|> {thought with steps separated with '\n\n'} <|end_of_thought|> Each step should include detailed considerations such as analisying questions, summarizing relevant findings, brainstorming new ideas, verifying the accuracy of the current steps, refining any errors, and revisiting previous steps. In the Solution section, based on various attempts, explorations, and reflections from the Thought section, systematically present the final solution that you deem correct. The solution should remain a logical, accurate, concise expression style and detail necessary step needed to reach the conclusion, formatted as follows: <|begin_of_solution|> {final formatted, precise, and clear solution} <|end_of_solution|> Now, try to solve the following question through the above guidelines:
""".strip()
prompt = "Which number is bigger, 3.9 or 3.11?"
conversation = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": prompt}
]
inputs = tokenizer.apply_chat_template(
    conversation=conversation,
    tokenize=True,
    return_tensors="pt",
)

outputs = model.generate(
    inputs, 
    tokenizer=tokenizer,
    generation_config=generation_config, 
    streamer=steamer
)

📚 详细文档

我们通过在Reason-Distill上进行监督微调（SFT）来构建Doge-Reason-Distill模型。

⚠️ 重要提示

更大的模型正在训练中，即将上传。

监督微调（SFT）信息

属性	详情
模型	Doge-160M-Reason-Distil
训练数据	SmallDoge/Reason-Distill
训练轮数	2
内容长度	4096
学习率	4e - 4
批次大小	0.5M
精度	bfloat16

训练流程

监督微调（SFT）：

训练环境

镜像：nvcr.io/nvidia/pytorch:24.12-py3
硬件：1x NVIDIA RTX 4090
软件：Transformers, TRL

📄 许可证

本项目采用Apache-2.0许可证。

📖 引用

如果您在研究中使用了本模型，请使用以下BibTeX引用：

@misc{shi2024wonderfulmatrices,
      title={Wonderful Matrices: Combining for a More Efficient and Effective Foundation Model Architecture}, 
      author={Jingze Shi and Bingheng Wu},
      year={2024},
      eprint={2412.11834},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2412.11834}, 
}