Snakmodel-7b-instructオープンソース大規模言語モデル - 無料でのデプロイでデンマーク語のスマートな交流対話をサポート

ホーム

Snakmodel 7b Instruct

NLPnorthによって開発

SnakModelは、デンマーク語用に特別に設計された70億パラメータの大規模言語モデルで、Llama 2アーキテクチャに基づいており、コペンハーゲンIT大学によって開発されました。

大規模言語モデル

Transformers

その他#デンマーク語専用 #指令微調整 #Llama2アーキテクチャ

ダウンロード数 134

リリース時間 : 10/17/2024

モデル概要

Llama 2アーキテクチャに基づくデンマーク語の大規模言語モデルで、136億語のデンマーク語コーパスで事前学習され、370万の指令ペアで微調整されており、デンマーク語関連のNLPタスクに優れています。

モデル特徴

デンマーク語最適化

デンマーク語用に特別に設計され、136億語のデンマーク語コーパスで事前学習されており、デンマーク語の理解と生成能力が汎用モデルよりも著しく優れています。

指令微調整バージョン

基本版と指令微調整版があり、後者は370万のデンマーク語指令 - 回答ペアで微調整されており、ユーザーの指令をよりよく追従できます。

効率的な学習

4台のNVIDIA A100 GPUを使用して8928時間で学習を完了し、二酸化炭素排出量は272.3kg CO2eqです。

モデル能力

デンマーク語テキスト生成

デンマーク語質問応答システム

デンマーク語指令追従

デンマーク語テキスト理解

使用事例

教育

デンマーク語学習アシスタント

学生がデンマーク語の内容を理解し、生成するのを支援します。

言語理解タスク(LA)で56.28 mF1スコアを達成しました。

カスタマーサービス

デンマーク語カスタマーサービスロボット

デンマーク語の顧客相談を処理します。

感情分析(Senti)タスクで66.70 mF1スコアを達成しました。

🚀 SnakModel

SnakModelは、デンマーク語用に特別に設計された70億パラメータのモデルです。Llama 2アーキテクチャに基づき、大量のデンマーク語コーパスで事前学習と微調整が行われており、デンマーク語関連のタスクを効果的に処理し、デンマーク語の自然言語処理に強力なサポートを提供します。

🚀 クイックスタート

以下はapply_chat_templateを使用するコードスニペットで、トークナイザーとモデルをロードし、コンテンツを生成する方法を示しています。

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "NLPnorth/snakmodel-7b-instruct"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

prompt = "Hvor ligger IT Universitet?"
messages = [
    {"role": "system", "content": "Du er Snakmodel, skabt af IT-Universitetet i København. Du er en hjælpsom assistent."},
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=20
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)

✨ 主な機能

デンマーク語専用：Llama 2アーキテクチャに基づき、豊富なデンマーク語コーパスで事前学習と微調整が行われているため、デンマーク語の処理能力が高いです。
複数のバージョン：指令微調整版とベース版が提供され、各モデルには中間チェックポイントも含まれています。
特定のテンプレートに従う：入力は[INST] {instruction} [/INST]テンプレートに従うため、使いやすいです。

📚 ドキュメント

モデルの詳細

モデル開発者：デンマークのコペンハーゲン情報技術大学（IT University of Copenhagen）のNLPnorth研究グループ。
バリエーション：SnakModelには指令微調整版とベース版があり、各モデルはモデルの改訂版に中間チェックポイントを含んでいます。
入力：テキスト入力のみをサポートし、指令は[INST] {instruction} [/INST]テンプレートに従う必要があります。
出力：テキストのみを出力します。
モデルアーキテクチャ：SnakModelはTransformerベースの自己回帰型言語モデルです。指令微調整版は、デンマーク語の指令追従を実現するために教師あり微調整（SFT）を使用しています。
モデルの日付：SnakModelは2024年1月から2024年9月の間に学習されました。
ライセンス：このモデルは、元のLlama 2ライセンス契約に従っています。
研究論文：2025年第1四半期に公開予定です。

想定される用途と制限

想定されるユースケース：SnakModelはデンマーク語専用で、指令微調整版はアシスタントのようなチャットシーンに適しています。指令微調整版はLlama 2（チャット）の指令テンプレートに従い、指令は特殊なタグで囲む必要があります。つまり[INST] {instruction} [/INST]です。
制限：SnakModelのバリエーションはデンマーク語データで微調整されているため、他の言語での使用は想定外です。SnakModelは他のLlama 2ベースのモデルよりもデンマーク語に関して熟練していますが、依然として事実誤りの出力を生成することが多いです。モデルをデプロイする前に、これらの要素を十分に評価し、元のLlama 2ライセンス契約に従ってください。

ハードウェアとソフトウェア

学習要因：SnakModelは私有インフラストラクチャで学習され、4つのNVIDIA A100 - PCIe 40GB GPUを搭載した1つのノードを使用しています。このノードにはAMD Epyc 7662 128コアプロセッサと1TBのRAMが搭載されています。
炭素排出量：総学習時間は8928 GPU時間で、平均炭素効率は0.122kg CO2eq / kWhです。機械学習影響計算機によると、これは272.3kg CO2eqの排出に相当します。

学習データ

概要：SnakModelは3.5億個のドキュメントと136億個の単語を含む多様なデンマーク語コーパスで連続的に事前学習されました。指令微調整版はさらに370万個のデンマーク語の指令 - 回答ペアで微調整されました。
データの新鮮さ：事前学習データの締め切り日は2024年1月です。

評価結果

モデル	LA (mF1)	NER (μF1)	Senti (mF1)	Summ (BERTScore)	CSR (Acc.)	QA (F1)	TM (Acc.)	CT (Acc.)	AVG
LLaMA2 - 7B_base	33.43	22.31	61.54	65.50	29.76	63.54	38.69	57.05	46.48
LLaMA2 - 7B_chat	47.42	24.63	62.35	66.15	32.24	61.34	46.67	55.18	49.50
LLaMA2 - 7B_base + INST₍d₎ₐ	36.10	28.48	62.86	66.43	29.04	64.40	49.10	58.46	49.35
LLaMA2 - 7B_chat + INST₍d₎ₐ	43.40	29.70	65.92	65.81	30.95	62.46	57.26	55.59	51.39
Viking - 7B	33.67	17.18	49.48	61.96	25.11	56.29	23.97	34.90	37.82
SnakModel - 7B_base	56.28	19.91	57.42	58.95	30.47	18.52	69.14	60.93	46.45
SnakModel - 7B_inst	52.91	29.76	66.70	66.61	29.46	64.66	71.05	71.88	56.63

引用

@inproceedings{zhang-etal-2025-snakmodel,
    title = "{SnakModel}: {Lessons} Learned from Training an Open {Danish} Large Language Model",
    author = {Zhang, Mike  and
      M{\"u}ller-Eberstein, Max  and
      Bassignana, Elisa  and
      Goot, Rob van der},
    editor = "Johansson, Richard  and
      Stymne, Sara",
    booktitle = "Proceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025)",
    month = mar,
    year = "2025",
    address = "Tallinn, Estonia",
    publisher = "University of Tartu Library",
    url = "https://aclanthology.org/2025.nodalida-1.80/",
    pages = "812--825",
    ISBN = "978-9908-53-109-0",
    abstract = "We present SnakModel, a Danish large language model (LLM) based on Llama2-7B, which we continuously pre-train on 13.6B Danish words, and further tune on 3.7M Danish instructions. As best practices for creating LLMs for smaller language communities have yet to be established, we examine the effects of early modeling and training decisions on downstream performance throughout the entire training pipeline, including (1) the creation of a strictly curated corpus of Danish text from diverse sources; (2) the language modeling and instruction-tuning training process itself, including the analysis of intermediate training dynamics, and ablations across different hyperparameters; (3) an evaluation on eight language and culturally-specific tasks. Across these experiments SnakModel achieves the highest overall performance, outperforming multiple contemporary Llama2-7B-based models. By making SnakModel, the majority of our pre-training corpus, and the associated code available under open licenses, we hope to foster further research and development in Danish Natural Language Processing, and establish training guidelines for languages with similar resource constraints."
}