Doge-160M-Reason-Distillオープンソース軽量級言語モデル

ホーム

Doge 160M Reason Distill

SmallDogeによって開発

Doge 160M 推論蒸留版は、動的マスクアテンションメカニズムとクロスドメイン混合専門家に基づく軽量言語モデルで、推論と質問応答タスクに特化しています。

大規模言語モデル

Transformers

英語オープンソースライセンス:Apache-2.0 #動的マスクアテンション #推論蒸留最適化 #長鎖思考補助

ダウンロード数 26

リリース時間 : 2/18/2025

モデル概要

このモデルは動的マスクアテンションメカニズムを使用してシーケンス変換を行い、多層パーセプトロンまたはクロスドメイン混合専門家を選択して状態変換を行います。動的マスクアテンションメカニズムにより、Transformerはトレーニング時に自己アテンションメカニズムを使用し、推論時に状態空間メカニズムに切り替えることができます。

モデル特徴

動的マスクアテンションメカニズム

トレーニング時に自己アテンションメカニズムを使用し、推論時に状態空間メカニズムに切り替えることができ、推論効率を向上させます。

クロスドメイン混合専門家

多層パーセプトロンの重みを直接継承して後続のトレーニングを行うことができ、モデルの適応性を向上させます。

推論蒸留

Reason-Distillデータセットで教師あり微調整を行い、推論能力を最適化します。

モデル能力

質問応答生成

論理的推論

数学問題解答

使用事例

教育

数学問題解答

基礎的な数学の比較と計算問題を解答

数字の大小を正しく比較し、推論プロセスを提供できる

インテリジェントアシスタント

体系的な問題解答

特定のフォーマットで詳細な思考プロセスと解決策を提供

構造化された思考プロセスと最終的な解決策を生成できる

🚀 Doge 160M Reason Distill

Doge 160M Reason Distillは、質問応答タスクに特化したモデルです。Dynamic Mask Attentionをシーケンス変換に、Multi - Layer PerceptronまたはCross Domain Mixture of Expertsを状態変換に使用しています。SmallDogeコミュニティによって開発され、詳細なアルゴリズムやモデルアーキテクチャについては論文を参照できます。

🚀 クイックスタート

DogeはDynamic Mask Attentionをシーケンス変換に、Multi-Layer PerceptronまたはCross Domain Mixture of Expertsを状態変換に使用しています。Dynamic Mask Attentionにより、Transformerはトレーニング時にセルフアテンションを、推論時に状態空間を使用できます。また、Cross Domain Mixture of ExpertsはMulti-Layer Perceptronの重みを直接引き継いでさらなるトレーニングが可能です。このモデルはSmallDogeコミュニティによってトレーニングされており、詳細なアルゴリズムやモデルアーキテクチャについてはWonderful Matricesを参照してください。すべてのトレーニング詳細とコードはsmall-dogeリポジトリで公開されています。

💻 使用例

基本的な使用法

from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig, TextStreamer

tokenizer = AutoTokenizer.from_pretrained("SmallDoge/Doge-160M-Reason-Distill")
model = AutoModelForCausalLM.from_pretrained("SmallDoge/Doge-160M-Reason-Distill", trust_remote_code=True)

generation_config = GenerationConfig(
      max_new_tokens=100, 
      use_cache=True, 
      do_sample=True, 
      temperature=0.8, 
      top_p=0.9,
      repetition_penalty=1.0
)
steamer = TextStreamer(
      tokenizer=tokenizer, 
      skip_prompt=True
)

system_prompt = """
Your role as an assistant involves thoroughly exploring questions through a systematic long thinking process before providing the final precise and accurate solutions. This requires engaging in a comprehensive cycle of analysis, summarizing, exploration, reassessment, reflection, backtracing, and iteration to develop well-considered thinking process. Please structure your response into two main sections: Thought and Solution. In the Thought section, detail your reasoning process using the specified format: <|begin_of_thought|> {thought with steps separated with '\n\n'} <|end_of_thought|> Each step should include detailed considerations such as analisying questions, summarizing relevant findings, brainstorming new ideas, verifying the accuracy of the current steps, refining any errors, and revisiting previous steps. In the Solution section, based on various attempts, explorations, and reflections from the Thought section, systematically present the final solution that you deem correct. The solution should remain a logical, accurate, concise expression style and detail necessary step needed to reach the conclusion, formatted as follows: <|begin_of_solution|> {final formatted, precise, and clear solution} <|end_of_solution|> Now, try to solve the following question through the above guidelines:
""".strip()
prompt = "Which number is bigger, 3.9 or 3.11?"
conversation = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": prompt}
]
inputs = tokenizer.apply_chat_template(
    conversation=conversation,
    tokenize=True,
    return_tensors="pt",
)

outputs = model.generate(
    inputs, 
    tokenizer=tokenizer,
    generation_config=generation_config, 
    streamer=steamer
)

📚 ドキュメント

モデル詳細

私たちはReason-Distill上でSFTを行い、Doge-Reason-Distillを構築しています。

TODO: より大きなモデルがトレーニング中で、近日アップロード予定です。

SFT:

モデル	トレーニングデータ	エポック数	コンテンツ長	学習率	バッチサイズ	精度
Doge-160M-Reason-Distil	SmallDoge/Reason-Distill	2	4096	4e-4	0.5M	bfloat16

手順:

SFT:

環境:

イメージ: nvcr.io/nvidia/pytorch:24.12-py3
ハードウェア: 1x NVIDIA RTX 4090
ソフトウェア: Transformers, TRL

📄 ライセンス

このプロジェクトはApache-2.0ライセンスの下で公開されています。

📚 引用

@misc{shi2024wonderfulmatrices,
      title={Wonderful Matrices: Combining for a More Efficient and Effective Foundation Model Architecture}, 
      author={Jingze Shi and Bingheng Wu},
      year={2024},
      eprint={2412.11834},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2412.11834}, 
}