🚀 SnakModel
SnakModel是一款专为丹麦语设计的70亿参数模型。它基于Llama 2架构,在大量丹麦语语料上进行预训练和微调,能有效处理丹麦语相关任务,为丹麦语自然语言处理提供强大支持。
🚀 快速开始
以下是一个使用apply_chat_template
的代码片段,展示了如何加载分词器和模型,以及如何生成内容。
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "NLPnorth/snakmodel-7b-instruct"
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
prompt = "Hvor ligger IT Universitet?"
messages = [
{"role": "system", "content": "Du er Snakmodel, skabt af IT-Universitetet i København. Du er en hjælpsom assistent."},
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
generated_ids = model.generate(
**model_inputs,
max_new_tokens=20
)
generated_ids = [
output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)
✨ 主要特性
- 专为丹麦语设计:基于Llama 2架构,在丰富的丹麦语语料上进行预训练和微调,对丹麦语的处理能力更强。
- 多种版本:提供指令微调版本和基础版本,每个模型还包含中间检查点。
- 遵循特定模板:输入遵循
[INST] {instruction} [/INST]
模板,便于使用。
📚 详细文档
模型详情
- 模型开发者:丹麦哥本哈根信息技术大学(IT University of Copenhagen)的NLPnorth研究小组。
- 变体:SnakModel有指令微调版和基础版,每个模型在模型修订下包含中间检查点。
- 输入:仅支持文本输入,指令需遵循
[INST] {instruction} [/INST]
模板。
- 输出:仅输出文本。
- 模型架构:SnakModel是一个基于Transformer的自回归语言模型。指令微调版本使用监督微调(SFT)来实现丹麦语指令跟随。
- 模型日期:SnakModel于2024年1月至2024年9月期间进行训练。
- 许可证:该模型遵循原始的Llama 2许可协议。
- 研究论文:计划于2025年第一季度发布。
预期用途与限制
- 预期用例:SnakModel专为丹麦语设计,指令微调版本适用于类似助手的聊天场景。指令微调版遵循Llama 2(聊天)指令模板,指令需封装在特殊标记中,即
[INST] {instruction} [/INST]
。
- 限制:SnakModel变体在丹麦语数据上进行微调,因此在其他语言中的使用超出了范围。尽管SnakModel在丹麦语方面比其他基于Llama 2的模型更熟练,但仍经常生成事实错误的输出。在部署模型之前,请务必仔细评估和权衡这些因素,并遵守原始的Llama 2许可协议。
硬件和软件
- 训练因素:SnakModel在私有基础设施上进行训练,使用一个节点,包含四个NVIDIA A100 - PCIe 40GB GPU。该节点配备AMD Epyc 7662 128核处理器和1TB RAM。
- 碳足迹:总训练时间为8928 GPU小时,平均碳效率为0.122kg CO2eq / kWh。根据机器学习影响计算器,这相当于排放了272.3kg CO2eq。
训练数据
- 概述:SnakModel在包含3.5亿个文档和136亿个单词的多样化丹麦语语料库上进行连续预训练。指令微调版本进一步在370万个丹麦语指令 - 答案对上进行微调。
- 数据新鲜度:预训练数据的截止日期为2024年1月。
评估结果
模型 |
LA (mF1) |
NER (μF1) |
Senti (mF1) |
Summ (BERTScore) |
CSR (Acc.) |
QA (F1) |
TM (Acc.) |
CT (Acc.) |
AVG |
LLaMA2 - 7B_base |
33.43 |
22.31 |
61.54 |
65.50 |
29.76 |
63.54 |
38.69 |
57.05 |
46.48 |
LLaMA2 - 7B_chat |
47.42 |
24.63 |
62.35 |
66.15 |
32.24 |
61.34 |
46.67 |
55.18 |
49.50 |
LLaMA2 - 7B_base + INST₍d₎ₐ |
36.10 |
28.48 |
62.86 |
66.43 |
29.04 |
64.40 |
49.10 |
58.46 |
49.35 |
LLaMA2 - 7B_chat + INST₍d₎ₐ |
43.40 |
29.70 |
65.92 |
65.81 |
30.95 |
62.46 |
57.26 |
55.59 |
51.39 |
Viking - 7B |
33.67 |
17.18 |
49.48 |
61.96 |
25.11 |
56.29 |
23.97 |
34.90 |
37.82 |
SnakModel - 7B_base |
56.28 |
19.91 |
57.42 |
58.95 |
30.47 |
18.52 |
69.14 |
60.93 |
46.45 |
SnakModel - 7B_inst |
52.91 |
29.76 |
66.70 |
66.61 |
29.46 |
64.66 |
71.05 |
71.88 |
56.63 |
引用
@inproceedings{zhang-etal-2025-snakmodel,
title = "{SnakModel}: {Lessons} Learned from Training an Open {Danish} Large Language Model",
author = {Zhang, Mike and
M{\"u}ller-Eberstein, Max and
Bassignana, Elisa and
Goot, Rob van der},
editor = "Johansson, Richard and
Stymne, Sara",
booktitle = "Proceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025)",
month = mar,
year = "2025",
address = "Tallinn, Estonia",
publisher = "University of Tartu Library",
url = "https://aclanthology.org/2025.nodalida-1.80/",
pages = "812--825",
ISBN = "978-9908-53-109-0",
abstract = "We present SnakModel, a Danish large language model (LLM) based on Llama2-7B, which we continuously pre-train on 13.6B Danish words, and further tune on 3.7M Danish instructions. As best practices for creating LLMs for smaller language communities have yet to be established, we examine the effects of early modeling and training decisions on downstream performance throughout the entire training pipeline, including (1) the creation of a strictly curated corpus of Danish text from diverse sources; (2) the language modeling and instruction-tuning training process itself, including the analysis of intermediate training dynamics, and ablations across different hyperparameters; (3) an evaluation on eight language and culturally-specific tasks. Across these experiments SnakModel achieves the highest overall performance, outperforming multiple contemporary Llama2-7B-based models. By making SnakModel, the majority of our pre-training corpus, and the associated code available under open licenses, we hope to foster further research and development in Danish Natural Language Processing, and establish training guidelines for languages with similar resource constraints."
}
📄 许可证
本模型遵循原始的Llama 2许可协议。