Snakmodel-7b-instruct開源大語言模型 - 免費部署助力丹麥語智能交流對話

首頁

Snakmodel 7b Instruct

由NLPnorth開發

SnakModel是一款專為丹麥語設計的70億參數大語言模型，基於Llama 2架構，由哥本哈根IT大學開發。

大型語言模型

Transformers

其他#丹麥語專用 #指令微調 #Llama2架構

下載量 134

發布時間 : 10/17/2024

模型概述

基於Llama 2架構的丹麥語大語言模型，經過136億單詞的丹麥語語料預訓練和370萬指令對微調，擅長丹麥語相關NLP任務。

模型特點

丹麥語優化

專為丹麥語設計，在136億單詞的丹麥語語料上進行預訓練，對丹麥語的理解和生成能力顯著優於通用模型

指令微調版本

提供基礎版和指令微調版，後者經過370萬丹麥語指令-答案對微調，能更好地遵循用戶指令

高效訓練

使用4塊NVIDIA A100 GPU在8928小時內完成訓練，碳足跡272.3kg CO2eq

模型能力

丹麥語文本生成

丹麥語問答系統

丹麥語指令跟隨

丹麥語文本理解

使用案例

教育

丹麥語學習助手

幫助學生理解和生成丹麥語內容

在語言理解任務(LA)上達到56.28 mF1分數

客服

丹麥語客服機器人

處理丹麥語客戶諮詢

在情感分析(Senti)任務上達到66.70 mF1分數

🚀 SnakModel

SnakModel是一款專為丹麥語設計的70億參數模型。它基於Llama 2架構，在大量丹麥語語料上進行預訓練和微調，能有效處理丹麥語相關任務，為丹麥語自然語言處理提供強大支持。

🚀 快速開始

以下是一個使用apply_chat_template的代碼片段，展示瞭如何加載分詞器和模型，以及如何生成內容。

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "NLPnorth/snakmodel-7b-instruct"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

prompt = "Hvor ligger IT Universitet?"
messages = [
    {"role": "system", "content": "Du er Snakmodel, skabt af IT-Universitetet i København. Du er en hjælpsom assistent."},
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=20
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)

✨ 主要特性

專為丹麥語設計：基於Llama 2架構，在豐富的丹麥語語料上進行預訓練和微調，對丹麥語的處理能力更強。
多種版本：提供指令微調版本和基礎版本，每個模型還包含中間檢查點。
遵循特定模板：輸入遵循[INST] {instruction} [/INST]模板，便於使用。

📚 詳細文檔

模型詳情

模型開發者：丹麥哥本哈根信息技術大學（IT University of Copenhagen）的NLPnorth研究小組。
變體：SnakModel有指令微調版和基礎版，每個模型在模型修訂下包含中間檢查點。
輸入：僅支持文本輸入，指令需遵循[INST] {instruction} [/INST]模板。
輸出：僅輸出文本。
模型架構：SnakModel是一個基於Transformer的自迴歸語言模型。指令微調版本使用監督微調（SFT）來實現丹麥語指令跟隨。
模型日期：SnakModel於2024年1月至2024年9月期間進行訓練。
許可證：該模型遵循原始的Llama 2許可協議。
研究論文：計劃於2025年第一季度發佈。

預期用途與限制

預期用例：SnakModel專為丹麥語設計，指令微調版本適用於類似助手的聊天場景。指令微調版遵循Llama 2（聊天）指令模板，指令需封裝在特殊標記中，即[INST] {instruction} [/INST]。
限制：SnakModel變體在丹麥語數據上進行微調，因此在其他語言中的使用超出了範圍。儘管SnakModel在丹麥語方面比其他基於Llama 2的模型更熟練，但仍經常生成事實錯誤的輸出。在部署模型之前，請務必仔細評估和權衡這些因素，並遵守原始的Llama 2許可協議。

硬件和軟件

訓練因素：SnakModel在私有基礎設施上進行訓練，使用一個節點，包含四個NVIDIA A100 - PCIe 40GB GPU。該節點配備AMD Epyc 7662 128核處理器和1TB RAM。
碳足跡：總訓練時間為8928 GPU小時，平均碳效率為0.122kg CO2eq / kWh。根據機器學習影響計算器，這相當於排放了272.3kg CO2eq。

訓練數據

概述：SnakModel在包含3.5億個文檔和136億個單詞的多樣化丹麥語語料庫上進行連續預訓練。指令微調版本進一步在370萬個丹麥語指令 - 答案對上進行微調。
數據新鮮度：預訓練數據的截止日期為2024年1月。

評估結果

模型	LA (mF1)	NER (μF1)	Senti (mF1)	Summ (BERTScore)	CSR (Acc.)	QA (F1)	TM (Acc.)	CT (Acc.)	AVG
LLaMA2 - 7B_base	33.43	22.31	61.54	65.50	29.76	63.54	38.69	57.05	46.48
LLaMA2 - 7B_chat	47.42	24.63	62.35	66.15	32.24	61.34	46.67	55.18	49.50
LLaMA2 - 7B_base + INST₍d₎ₐ	36.10	28.48	62.86	66.43	29.04	64.40	49.10	58.46	49.35
LLaMA2 - 7B_chat + INST₍d₎ₐ	43.40	29.70	65.92	65.81	30.95	62.46	57.26	55.59	51.39
Viking - 7B	33.67	17.18	49.48	61.96	25.11	56.29	23.97	34.90	37.82
SnakModel - 7B_base	56.28	19.91	57.42	58.95	30.47	18.52	69.14	60.93	46.45
SnakModel - 7B_inst	52.91	29.76	66.70	66.61	29.46	64.66	71.05	71.88	56.63

引用

@inproceedings{zhang-etal-2025-snakmodel,
    title = "{SnakModel}: {Lessons} Learned from Training an Open {Danish} Large Language Model",
    author = {Zhang, Mike  and
      M{\"u}ller-Eberstein, Max  and
      Bassignana, Elisa  and
      Goot, Rob van der},
    editor = "Johansson, Richard  and
      Stymne, Sara",
    booktitle = "Proceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025)",
    month = mar,
    year = "2025",
    address = "Tallinn, Estonia",
    publisher = "University of Tartu Library",
    url = "https://aclanthology.org/2025.nodalida-1.80/",
    pages = "812--825",
    ISBN = "978-9908-53-109-0",
    abstract = "We present SnakModel, a Danish large language model (LLM) based on Llama2-7B, which we continuously pre-train on 13.6B Danish words, and further tune on 3.7M Danish instructions. As best practices for creating LLMs for smaller language communities have yet to be established, we examine the effects of early modeling and training decisions on downstream performance throughout the entire training pipeline, including (1) the creation of a strictly curated corpus of Danish text from diverse sources; (2) the language modeling and instruction-tuning training process itself, including the analysis of intermediate training dynamics, and ablations across different hyperparameters; (3) an evaluation on eight language and culturally-specific tasks. Across these experiments SnakModel achieves the highest overall performance, outperforming multiple contemporary Llama2-7B-based models. By making SnakModel, the majority of our pre-training corpus, and the associated code available under open licenses, we hope to foster further research and development in Danish Natural Language Processing, and establish training guidelines for languages with similar resource constraints."
}