Doge-160M-Reason-Distill開源輕量級語言模型

首頁

Doge 160M Reason Distill

由SmallDoge開發

Doge 160M 推理蒸餾版是一個基於動態掩碼注意力機制和跨域混合專家的輕量級語言模型，專注於推理和問答任務。

大型語言模型

Transformers

英語開源協議:Apache-2.0 #動態掩碼注意力 #推理蒸餾優化 #長鏈思考輔助

下載量 26

發布時間 : 2/18/2025

模型概述

該模型採用動態掩碼注意力機制進行序列變換，可選擇多層感知機或跨域混合專家進行狀態轉換。動態掩碼注意力機制使Transformer能在訓練時使用自注意力機制，在推理時切換為狀態空間機制。

模型特點

動態掩碼注意力機制

允許在訓練時使用自注意力機制，在推理時切換為狀態空間機制，提高推理效率。

跨域混合專家

可直接繼承多層感知機的權重進行後續訓練，提高模型適應性。

推理蒸餾

在Reason-Distill數據集上進行監督微調，優化推理能力。

模型能力

問答生成

邏輯推理

數學問題解答

使用案例

教育

數學問題解答

解答基礎數學比較和計算問題

能正確比較數字大小並提供推理過程

智能助手

系統化問題解答

按照特定格式提供詳細思考過程和解決方案

能生成結構化的思考過程和最終解決方案

🚀 Doge 160M Reason Distill

Doge 160M Reason Distill模型使用動態掩碼注意力進行序列轉換，可採用多層感知器或跨域專家混合進行狀態轉換。它由SmallDoge社區訓練，能有效處理問答任務。詳細算法和模型架構可參考相關論文，所有訓練細節和代碼均在GitHub倉庫公開。

🚀 快速開始

Doge使用動態掩碼注意力進行序列轉換，並可以使用多層感知器或跨域專家混合作為狀態轉換。動態掩碼注意力允許Transformer在訓練期間使用自注意力，在推理期間使用狀態空間，而跨域專家混合可以直接繼承多層感知器的權重以進行進一步訓練。此模型由SmallDoge社區訓練，有關詳細的算法和模型架構，請參考Wonderful Matrices，所有訓練細節和代碼都可在small-doge倉庫中公開獲取。

💻 使用示例

基礎用法

from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig, TextStreamer

tokenizer = AutoTokenizer.from_pretrained("SmallDoge/Doge-160M-Reason-Distill")
model = AutoModelForCausalLM.from_pretrained("SmallDoge/Doge-160M-Reason-Distill", trust_remote_code=True)

generation_config = GenerationConfig(
      max_new_tokens=100, 
      use_cache=True, 
      do_sample=True, 
      temperature=0.8, 
      top_p=0.9,
      repetition_penalty=1.0
)
steamer = TextStreamer(
      tokenizer=tokenizer, 
      skip_prompt=True
)

system_prompt = """
Your role as an assistant involves thoroughly exploring questions through a systematic long thinking process before providing the final precise and accurate solutions. This requires engaging in a comprehensive cycle of analysis, summarizing, exploration, reassessment, reflection, backtracing, and iteration to develop well-considered thinking process. Please structure your response into two main sections: Thought and Solution. In the Thought section, detail your reasoning process using the specified format: <|begin_of_thought|> {thought with steps separated with '\n\n'} <|end_of_thought|> Each step should include detailed considerations such as analisying questions, summarizing relevant findings, brainstorming new ideas, verifying the accuracy of the current steps, refining any errors, and revisiting previous steps. In the Solution section, based on various attempts, explorations, and reflections from the Thought section, systematically present the final solution that you deem correct. The solution should remain a logical, accurate, concise expression style and detail necessary step needed to reach the conclusion, formatted as follows: <|begin_of_solution|> {final formatted, precise, and clear solution} <|end_of_solution|> Now, try to solve the following question through the above guidelines:
""".strip()
prompt = "Which number is bigger, 3.9 or 3.11?"
conversation = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": prompt}
]
inputs = tokenizer.apply_chat_template(
    conversation=conversation,
    tokenize=True,
    return_tensors="pt",
)

outputs = model.generate(
    inputs, 
    tokenizer=tokenizer,
    generation_config=generation_config, 
    streamer=steamer
)

📚 詳細文檔

我們通過在Reason-Distill上進行監督微調（SFT）來構建Doge-Reason-Distill模型。

⚠️ 重要提示

更大的模型正在訓練中，即將上傳。

監督微調（SFT）信息

屬性	詳情
模型	Doge-160M-Reason-Distil
訓練數據	SmallDoge/Reason-Distill
訓練輪數	2
內容長度	4096
學習率	4e - 4
批次大小	0.5M
精度	bfloat16

訓練流程

監督微調（SFT）：

訓練環境

鏡像：nvcr.io/nvidia/pytorch:24.12-py3
硬件：1x NVIDIA RTX 4090
軟件：Transformers, TRL

📄 許可證

本項目採用Apache-2.0許可證。

📖 引用

如果您在研究中使用了本模型，請使用以下BibTeX引用：

@misc{shi2024wonderfulmatrices,
      title={Wonderful Matrices: Combining for a More Efficient and Effective Foundation Model Architecture}, 
      author={Jingze Shi and Bingheng Wu},
      year={2024},
      eprint={2412.11834},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2412.11834}, 
}