Llama3-German-8B-32k Open-Source German Large Model - Optimized for German and Supports Long Context Conversations

Llama3 German 8B 32k

Developed by DiscoResearch

A German-optimized large language model based on Meta Llama3-8B, continuously pre-trained on 65 billion German tokens, specifically optimized for German and supporting 32k long context

Large Language Model

Transformers

German#German Optimization #Long Context Support #Instruction Fine-tuning Compatibility

Downloads 91

Release Time : 5/24/2024

Model Overview

This model is a German-optimized variant of Llama3, trained on extensive high-quality German data, significantly improving German comprehension and generation capabilities while maintaining English proficiency

Model Features

German Optimization

Significantly improved German performance through continuous pre-training on 65 billion high-quality German tokens

Long Context Support

Supports long-context processing of up to 32k tokens

Multilingual Retention

Maintains original English capabilities without significant degradation while enhancing German proficiency

Efficient Training

Achieves over 99% training efficiency through optimized document packing strategies

Model Capabilities

German text generation

German language understanding

Long document processing

Multilingual support

Use Cases

Academic Research

German Academic Writing

Assists in writing German academic papers or reports

Generates German texts that comply with academic standards

Business Applications

German Content Creation

Generates marketing copy, product descriptions, and other business content

Produces natural and fluent German business texts

Education

German Learning Assistance

Serves as a language practice tool for German learners

Provides accurate demonstrations of German grammar and expressions

🚀 Llama3-German-8B-32k (version 0.1)

This model is a large language model specialized for the German language. It is based on Meta's Llama3-8B and is enhanced through continued pretraining on high - quality German tokens. It shows significant improvements in German language performance while maintaining reasonable English performance.

🚀 Quick Start

Here's how to use the model with transformers:

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

device="cuda"

model = AutoModelForCausalLM.from_pretrained(
    "DiscoResearch/Llama3-DiscoLeo-Instruct-8B-v0.1",
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("DiscoResearch/Llama3-DiscoLeo-Instruct-8B-v0.1")

prompt = "Schreibe ein Essay über die Bedeutung der Energiewende für Deutschlands Wirtschaft"
messages = [
    {"role": "system", "content": "Du bist ein hilfreicher Assistent."},
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(device)

generated_ids = model.generate(
    model_inputs.input_ids,
    max_new_tokens=512
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

✨ Features

German Specialization: Specialized for the German language through continuous pretraining on 65 billion high - quality tokens.
Long - Context Capability: A long - context version can process context lengths up to 65k tokens.
Instruction Tuning: An instruction - tuned version is available for better interaction.
Intelligent Document Packing: Employs an intelligent document packing strategy for higher benchmark scores.

📦 Installation

The README does not provide specific installation steps, so this section is skipped.

💻 Usage Examples

Basic Usage

# The code above in the Quick Start section shows the basic usage of the model.
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

device="cuda"

model = AutoModelForCausalLM.from_pretrained(
    "DiscoResearch/Llama3-DiscoLeo-Instruct-8B-v0.1",
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("DiscoResearch/Llama3-DiscoLeo-Instruct-8B-v0.1")

prompt = "Schreibe ein Essay über die Bedeutung der Energiewende für Deutschlands Wirtschaft"
messages = [
    {"role": "system", "content": "Du bist ein hilfreicher Assistent."},
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(device)

generated_ids = model.generate(
    model_inputs.input_ids,
    max_new_tokens=512
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

Advanced Usage

The README does not provide advanced usage examples, so this part is not further expanded.

📚 Documentation

Model Introduction

This version of the model refers to the long - context extension version described below. Llama3 - German - 8B - v0.1 is based on Meta's Llama3 - 8B and is specialized for the German language.

Model Training and Hyperparameters

The model was trained on 128 GPUs on hessian.Ai 42 for ~60 hours.

Parameter	Value
Sequence Length	8192 tokens
Learning Rate	1.5e - 5 to 1.5e - 6 (cosine schedule)
Batch Size	4194304 (512*8192) tokens
Micro Batch Size	4*8192 tokens
Training Steps	15500
Warmup Steps	155 (1%)
Weight Decay	0.05
Optimizer	AdamW

Data Collection and Preprocessing

For pre - training, 65B German tokens from the occiglot - fineweb - 0.5 dataset were used. The data comes from multiple curated datasets and Common - Crawl releases, and was further filtered and globally deduplicated.

Evaluation and Results

The model was evaluated using a suite of common English Benchmarks and their German counterparts with GermanBench.

Model	truthful_qa_de	truthfulqa_mc	arc_challenge	arc_challenge_de	hellaswag	hellaswag_de	MMLU	MMLU - DE	mean
DiscoResearch/Llama3 - German - 8B	0.49499	0.44838	0.55802	0.49829	0.79924	0.65395	0.62240	0.54413	0.57743
DiscoResearch/Llama3 - German - 8B - 32k	0.48920	0.45138	0.54437	0.49232	0.79078	0.64310	0.58774	0.47971	0.55982
meta - llama/Meta - Llama - 3 - 8B - Instruct	0.47498	0.43923	0.59642	0.47952	0.82025	0.60008	0.66658	0.53541	0.57656

Long - Context Extension

A long - context version of Llama3 - German - 8B (DiscoResearch/Llama3 - German - 8B - 32k) can process context lengths up to 65k tokens.

Instruction Tuning

An instruction - tuned version DiscoResearch/Llama3 - DiscoLeo - Instruct - 8B - v0.1 is available.

Document Packing

An intelligent document packing strategy based on the "Fewer Truncations Improve Language Modeling" paper by Ding et al. is employed.

def pack_documents(tokenized_documents):
    # Sort documents by their length in descending order
    sorted_docs = sorted(tokenized_documents, key=len, reverse=True)
    
    # Initialize bins
    bins = []
    
    # Function to find the first bin that can accommodate the document
    def find_bin(doc):
        for b in bins:
            if sum(len(d) for d in b) + len(doc) <= 8192:
                return b
        return None
    
    # Place each document in the first available bin or create a new bin
    for doc in sorted_docs:
        target_bin = find_bin(doc)
        if target_bin is not None:
            target_bin.append(doc)
        else:
            # Create a new bin with this document if no suitable bin is found
            bins.append([doc])
    
    # Return results
    return bins

Model Configurations

🔧 Technical Details

The README provides detailed information about model training, data processing, evaluation, and various techniques used, which can be considered as technical details. For example, the long - context extension, document packing strategy, and hyperparameter settings all contribute to the technical implementation of the model.

📄 License

The model uses the Llama3 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご