Llama3-German-8B-32k開源德語大模型 - 專為德語優化支持長上下文對話

首頁

Llama3 German 8B 32k

由DiscoResearch開發

基於Meta Llama3-8B優化的德語大語言模型，通過650億德語語料持續預訓練，專為德語優化並支持32k長上下文

大型語言模型

Transformers

德語#德語優化 #長上下文支持 #指令微調兼容

下載量 91

發布時間 : 5/24/2024

模型概述

該模型是針對德語優化的Llama3變體，通過大量高質量德語數據訓練，顯著提升德語理解與生成能力，同時保持英語能力

模型特點

德語優化

通過650億高質量德語token的持續預訓練，顯著提升德語表現

長上下文支持

支持32k token的長上下文處理能力

多語言保留

在提升德語能力的同時，保持原有英語能力不顯著下降

高效訓練

採用優化的文檔打包策略，訓練效率超過99%

模型能力

德語文本生成

德語語言理解

長文檔處理

多語言支持

使用案例

學術研究

德語論文寫作

協助撰寫德語學術論文或報告

生成符合學術規範的德語文本

商業應用

德語內容創作

生成營銷文案、產品描述等商業內容

產出自然流暢的德語商業文本

教育

德語學習輔助

作為德語學習者的語言練習工具

提供準確的德語語法和表達示範

🚀 Llama3-German-8B-32k (版本 0.1)

Llama3-German-8B-32k 是基於 Meta 的 Llama3-8B 開發的大語言模型，專門針對德語進行了優化。通過在 650 億高質量德語標記上進行持續預訓練，該模型在德語任務上表現出色，同時在英語任務上的性能也保持穩定。

🚀 快速開始

這是一個基礎模型，在使用前可能需要進行微調。你可以在我們的集合中找到各種微調版本和長上下文版本。

以下是使用 transformers 庫調用該模型的示例代碼：

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

device="cuda"

model = AutoModelForCausalLM.from_pretrained(
    "DiscoResearch/Llama3-DiscoLeo-Instruct-8B-v0.1",
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("DiscoResearch/Llama3-DiscoLeo-Instruct-8B-v0.1")

prompt = "Schreibe ein Essay über die Bedeutung der Energiewende für Deutschlands Wirtschaft"
messages = [
    {"role": "system", "content": "Du bist ein hilfreicher Assistent."},
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(device)

generated_ids = model.generate(
    model_inputs.input_ids,
    max_new_tokens=512
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

✨ 主要特性

德語優化：通過在 650 億高質量德語標記上進行持續預訓練，模型在德語任務上表現出色，顯著減少了語法錯誤，提升了語言理解和推理能力。
長上下文處理：提供長上下文版本（32k 上下文長度），能夠處理長達 65k 標記的上下文，適用於需要處理長文本的任務。
多版本選擇：包括基礎模型、長上下文版本、指令微調版本等多種配置，滿足不同應用場景的需求。

📦 安裝指南

文檔未提及安裝步驟，故跳過該章節。

💻 使用示例

基礎用法

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

device="cuda"

model = AutoModelForCausalLM.from_pretrained(
    "DiscoResearch/Llama3-DiscoLeo-Instruct-8B-v0.1",
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("DiscoResearch/Llama3-DiscoLeo-Instruct-8B-v0.1")

prompt = "Schreibe ein Essay über die Bedeutung der Energiewende für Deutschlands Wirtschaft"
messages = [
    {"role": "system", "content": "Du bist ein hilfreicher Assistent."},
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(device)

generated_ids = model.generate(
    model_inputs.input_ids,
    max_new_tokens=512
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

📚 詳細文檔

模型訓練與超參數

模型在 hessian.Ai 42 的 128 個 GPU 上訓練了約 60 小時。詳細的超參數如下：

參數	詳情
序列長度	8192 標記
學習率	1.5e-5 到 1.5e-6（餘弦調度）
批量大小	4194304（512*8192）標記
微批量大小	4*8192 標記
訓練步數	15500
熱身步數	155（1%）
權重衰減	0.05
優化器	AdamW

數據收集與預處理

預訓練階段，我們使用了來自 occiglot-fineweb-0.5 數據集的 650 億德語標記。數據包含來自 LLM-Datasets 的多個精選數據集，以及經過 OSCAR's Ungoliant 管道處理的 12 個 Common-Crawl 版本。

所有數據都使用了基於 Huggingface's fine-web 的特定語言過濾器進行進一步過濾，並進行了全局去重。

更多信息請參考數據集卡片和相應的博客文章。

評估與結果

我們使用 GermanBench 中的一系列常見英語基準測試及其德語版本對模型進行了評估。

下圖展示了與基礎模型 meta-llama/Meta-Llama3-8B 以及兩種不同超參數配置的基準測試結果對比。我們對不同的學習率進行了掃描，以確定一個效果良好的設置。最終發佈的模型是學習率為 1.5e-5 的版本。 alt text

基礎模型和長上下文模型的詳細基準測試分數如下表所示：

模型	truthful_qa_de	truthfulqa_mc	arc_challenge	arc_challenge_de	hellaswag	hellaswag_de	MMLU	MMLU-DE	平均值
DiscoResearch/Llama3-German-8B	0.49499	0.44838	0.55802	0.49829	0.79924	0.65395	0.62240	0.54413	0.57743
DiscoResearch/Llama3-German-8B-32k	0.48920	0.45138	0.54437	0.49232	0.79078	0.64310	0.58774	0.47971	0.55982
meta-llama/Meta-Llama-3-8B-Instruct	0.47498	0.43923	0.59642	0.47952	0.82025	0.60008	0.66658	0.53541	0.57656

長上下文擴展

除了基礎模型，我們還發布了 Llama3-German-8B 的長上下文版本 (DiscoResearch/Llama3-German-8B-32k)，能夠處理長達 65k 標記的上下文。該變體在 32k 上下文長度下額外訓練了 1 億標記，使用的 rope_theta 值為 1.5e6，學習率為 1.5e-5，批量大小為 256*8192 標記，其他超參數與基礎模型相同。

指令微調

我們還提供了一個指令微調版本：DiscoResearch/Llama3-DiscoLeo-Instruct-8B-v0.1，使用 DiscoLM 德語數據集進行微調（也有長上下文版本，DiscoResearch/Llama3-DiscoLeo-Instruct-8B-32k-v0.1）。

更多詳細信息請參考相應的模型卡片。此外，還可以查看我們的實驗性合併模型 (DiscoResearch/Llama3-DiscoLeo-8B-DARE-Experimental)，該模型嘗試結合 meta-llama/Meta-Llama3-8B-Instruct 的出色能力和我們微調模型的卓越德語技能。

文檔打包

我們採用了基於 "Fewer Truncations Improve Language Modeling" 論文（Ding 等人，2024）的更智能的文檔打包策略，使用首次適應遞減算法將文檔打包成批次，而不進行截斷。

我們將數據按 10000 個文檔為一組進行打包，以提高處理效率，同時保持 >99% 的打包效率。長度超過序列長度的文檔會被分割成序列長度的塊。

在相同數據和超參數的訓練條件下，這種方法在整體基準測試中取得了更高的分數。以下是使用 3e-5 學習率和 12k 步的初始實驗結果，顯示了與原論文相似的改進：

任務	簡單打包	減少截斷打包	百分比提升
truthfulqa_mc	0.452648	0.467687	3.32%
arc_challenge	0.517918	0.528157	1.98%
truthful_qa_de	0.485529	0.492979	1.53%
arc_challenge_de	0.480375	0.493174	2.66%
hellaswag	0.776041	0.773352	-0.35%
hellaswag_de	0.655248	0.653356	-0.29%
MMLU	0.573719	0.579802	1.06%
MMLU-DE	0.504509	0.503863	-0.13%

以下是論文中描述的首次適應遞減算法的簡單實現：

def pack_documents(tokenized_documents):
    # Sort documents by their length in descending order
    sorted_docs = sorted(tokenized_documents, key=len, reverse=True)
    
    # Initialize bins
    bins = []
    
    # Function to find the first bin that can accommodate the document
    def find_bin(doc):
        for b in bins:
            if sum(len(d) for d in b) + len(doc) <= 8192:
                return b
        return None
    
    # Place each document in the first available bin or create a new bin
    for doc in sorted_docs:
        target_bin = find_bin(doc)
        if target_bin is not None:
            target_bin.append(doc)
        else:
            # Create a new bin with this document if no suitable bin is found
            bins.append([doc])
    
    # Return results
    return bins