Snakmodel-7b-instruct开源大语言模型 - 免费部署助力丹麦语智能交流对话

Home

Snakmodel 7b Instruct

Developed by NLPnorth

SnakModel是一款专为丹麦语设计的70亿参数大语言模型，基于Llama 2架构，由哥本哈根IT大学开发。

大型语言模型

Transformers

Other#丹麦语专用 #指令微调 #Llama2架构

Downloads 134

Release Time : 10/17/2024

Model Overview

基于Llama 2架构的丹麦语大语言模型，经过136亿单词的丹麦语语料预训练和370万指令对微调，擅长丹麦语相关NLP任务。

Model Features

丹麦语优化

专为丹麦语设计，在136亿单词的丹麦语语料上进行预训练，对丹麦语的理解和生成能力显著优于通用模型

指令微调版本

提供基础版和指令微调版，后者经过370万丹麦语指令-答案对微调，能更好地遵循用户指令

高效训练

使用4块NVIDIA A100 GPU在8928小时内完成训练，碳足迹272.3kg CO2eq

Model Capabilities

丹麦语文本生成

丹麦语问答系统

丹麦语指令跟随

丹麦语文本理解

Use Cases

教育

丹麦语学习助手

帮助学生理解和生成丹麦语内容

在语言理解任务(LA)上达到56.28 mF1分数

客服

丹麦语客服机器人

处理丹麦语客户咨询

在情感分析(Senti)任务上达到66.70 mF1分数

🚀 SnakModel

SnakModel是一款专为丹麦语设计的70亿参数模型。它基于Llama 2架构，在大量丹麦语语料上进行预训练和微调，能有效处理丹麦语相关任务，为丹麦语自然语言处理提供强大支持。

🚀 快速开始

以下是一个使用apply_chat_template的代码片段，展示了如何加载分词器和模型，以及如何生成内容。

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "NLPnorth/snakmodel-7b-instruct"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

prompt = "Hvor ligger IT Universitet?"
messages = [
    {"role": "system", "content": "Du er Snakmodel, skabt af IT-Universitetet i København. Du er en hjælpsom assistent."},
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=20
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)

✨ 主要特性

专为丹麦语设计：基于Llama 2架构，在丰富的丹麦语语料上进行预训练和微调，对丹麦语的处理能力更强。
多种版本：提供指令微调版本和基础版本，每个模型还包含中间检查点。
遵循特定模板：输入遵循[INST] {instruction} [/INST]模板，便于使用。

📚 详细文档

模型详情

模型开发者：丹麦哥本哈根信息技术大学（IT University of Copenhagen）的NLPnorth研究小组。
变体：SnakModel有指令微调版和基础版，每个模型在模型修订下包含中间检查点。
输入：仅支持文本输入，指令需遵循[INST] {instruction} [/INST]模板。
输出：仅输出文本。
模型架构：SnakModel是一个基于Transformer的自回归语言模型。指令微调版本使用监督微调（SFT）来实现丹麦语指令跟随。
模型日期：SnakModel于2024年1月至2024年9月期间进行训练。
许可证：该模型遵循原始的Llama 2许可协议。
研究论文：计划于2025年第一季度发布。

预期用途与限制

预期用例：SnakModel专为丹麦语设计，指令微调版本适用于类似助手的聊天场景。指令微调版遵循Llama 2（聊天）指令模板，指令需封装在特殊标记中，即[INST] {instruction} [/INST]。
限制：SnakModel变体在丹麦语数据上进行微调，因此在其他语言中的使用超出了范围。尽管SnakModel在丹麦语方面比其他基于Llama 2的模型更熟练，但仍经常生成事实错误的输出。在部署模型之前，请务必仔细评估和权衡这些因素，并遵守原始的Llama 2许可协议。

硬件和软件

训练因素：SnakModel在私有基础设施上进行训练，使用一个节点，包含四个NVIDIA A100 - PCIe 40GB GPU。该节点配备AMD Epyc 7662 128核处理器和1TB RAM。
碳足迹：总训练时间为8928 GPU小时，平均碳效率为0.122kg CO2eq / kWh。根据机器学习影响计算器，这相当于排放了272.3kg CO2eq。

训练数据

概述：SnakModel在包含3.5亿个文档和136亿个单词的多样化丹麦语语料库上进行连续预训练。指令微调版本进一步在370万个丹麦语指令 - 答案对上进行微调。
数据新鲜度：预训练数据的截止日期为2024年1月。

评估结果

模型	LA (mF1)	NER (μF1)	Senti (mF1)	Summ (BERTScore)	CSR (Acc.)	QA (F1)	TM (Acc.)	CT (Acc.)	AVG
LLaMA2 - 7B_base	33.43	22.31	61.54	65.50	29.76	63.54	38.69	57.05	46.48
LLaMA2 - 7B_chat	47.42	24.63	62.35	66.15	32.24	61.34	46.67	55.18	49.50
LLaMA2 - 7B_base + INST₍d₎ₐ	36.10	28.48	62.86	66.43	29.04	64.40	49.10	58.46	49.35
LLaMA2 - 7B_chat + INST₍d₎ₐ	43.40	29.70	65.92	65.81	30.95	62.46	57.26	55.59	51.39
Viking - 7B	33.67	17.18	49.48	61.96	25.11	56.29	23.97	34.90	37.82
SnakModel - 7B_base	56.28	19.91	57.42	58.95	30.47	18.52	69.14	60.93	46.45
SnakModel - 7B_inst	52.91	29.76	66.70	66.61	29.46	64.66	71.05	71.88	56.63

引用

@inproceedings{zhang-etal-2025-snakmodel,
    title = "{SnakModel}: {Lessons} Learned from Training an Open {Danish} Large Language Model",
    author = {Zhang, Mike  and
      M{\"u}ller-Eberstein, Max  and
      Bassignana, Elisa  and
      Goot, Rob van der},
    editor = "Johansson, Richard  and
      Stymne, Sara",
    booktitle = "Proceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025)",
    month = mar,
    year = "2025",
    address = "Tallinn, Estonia",
    publisher = "University of Tartu Library",
    url = "https://aclanthology.org/2025.nodalida-1.80/",
    pages = "812--825",
    ISBN = "978-9908-53-109-0",
    abstract = "We present SnakModel, a Danish large language model (LLM) based on Llama2-7B, which we continuously pre-train on 13.6B Danish words, and further tune on 3.7M Danish instructions. As best practices for creating LLMs for smaller language communities have yet to be established, we examine the effects of early modeling and training decisions on downstream performance throughout the entire training pipeline, including (1) the creation of a strictly curated corpus of Danish text from diverse sources; (2) the language modeling and instruction-tuning training process itself, including the analysis of intermediate training dynamics, and ablations across different hyperparameters; (3) an evaluation on eight language and culturally-specific tasks. Across these experiments SnakModel achieves the highest overall performance, outperforming multiple contemporary Llama2-7B-based models. By making SnakModel, the majority of our pre-training corpus, and the associated code available under open licenses, we hope to foster further research and development in Danish Natural Language Processing, and establish training guidelines for languages with similar resource constraints."
}