🚀 Llama3-German-8B-32k (version 0.1)
This model is a large language model specialized for the German language. It is based on Meta's Llama3-8B and is enhanced through continued pretraining on high - quality German tokens. It shows significant improvements in German language performance while maintaining reasonable English performance.
🚀 Quick Start
Here's how to use the model with transformers:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
device="cuda"
model = AutoModelForCausalLM.from_pretrained(
"DiscoResearch/Llama3-DiscoLeo-Instruct-8B-v0.1",
torch_dtype="auto",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("DiscoResearch/Llama3-DiscoLeo-Instruct-8B-v0.1")
prompt = "Schreibe ein Essay über die Bedeutung der Energiewende für Deutschlands Wirtschaft"
messages = [
{"role": "system", "content": "Du bist ein hilfreicher Assistent."},
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(device)
generated_ids = model.generate(
model_inputs.input_ids,
max_new_tokens=512
)
generated_ids = [
output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
✨ Features
- German Specialization: Specialized for the German language through continuous pretraining on 65 billion high - quality tokens.
- Long - Context Capability: A long - context version can process context lengths up to 65k tokens.
- Instruction Tuning: An instruction - tuned version is available for better interaction.
- Intelligent Document Packing: Employs an intelligent document packing strategy for higher benchmark scores.
📦 Installation
The README does not provide specific installation steps, so this section is skipped.
💻 Usage Examples
Basic Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
device="cuda"
model = AutoModelForCausalLM.from_pretrained(
"DiscoResearch/Llama3-DiscoLeo-Instruct-8B-v0.1",
torch_dtype="auto",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("DiscoResearch/Llama3-DiscoLeo-Instruct-8B-v0.1")
prompt = "Schreibe ein Essay über die Bedeutung der Energiewende für Deutschlands Wirtschaft"
messages = [
{"role": "system", "content": "Du bist ein hilfreicher Assistent."},
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(device)
generated_ids = model.generate(
model_inputs.input_ids,
max_new_tokens=512
)
generated_ids = [
output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
Advanced Usage
The README does not provide advanced usage examples, so this part is not further expanded.
📚 Documentation
Model Introduction
This version of the model refers to the long - context extension version described below. Llama3 - German - 8B - v0.1 is based on Meta's Llama3 - 8B and is specialized for the German language.
Model Training and Hyperparameters
The model was trained on 128 GPUs on hessian.Ai 42 for ~60 hours.
Parameter |
Value |
Sequence Length |
8192 tokens |
Learning Rate |
1.5e - 5 to 1.5e - 6 (cosine schedule) |
Batch Size |
4194304 (512*8192) tokens |
Micro Batch Size |
4*8192 tokens |
Training Steps |
15500 |
Warmup Steps |
155 (1%) |
Weight Decay |
0.05 |
Optimizer |
AdamW |
Data Collection and Preprocessing
For pre - training, 65B German tokens from the occiglot - fineweb - 0.5 dataset were used. The data comes from multiple curated datasets and Common - Crawl releases, and was further filtered and globally deduplicated.
Evaluation and Results
The model was evaluated using a suite of common English Benchmarks and their German counterparts with GermanBench.
Model |
truthful_qa_de |
truthfulqa_mc |
arc_challenge |
arc_challenge_de |
hellaswag |
hellaswag_de |
MMLU |
MMLU - DE |
mean |
DiscoResearch/Llama3 - German - 8B |
0.49499 |
0.44838 |
0.55802 |
0.49829 |
0.79924 |
0.65395 |
0.62240 |
0.54413 |
0.57743 |
DiscoResearch/Llama3 - German - 8B - 32k |
0.48920 |
0.45138 |
0.54437 |
0.49232 |
0.79078 |
0.64310 |
0.58774 |
0.47971 |
0.55982 |
meta - llama/Meta - Llama - 3 - 8B - Instruct |
0.47498 |
0.43923 |
0.59642 |
0.47952 |
0.82025 |
0.60008 |
0.66658 |
0.53541 |
0.57656 |
Long - Context Extension
A long - context version of Llama3 - German - 8B (DiscoResearch/Llama3 - German - 8B - 32k) can process context lengths up to 65k tokens.
Instruction Tuning
An instruction - tuned version DiscoResearch/Llama3 - DiscoLeo - Instruct - 8B - v0.1 is available.
Document Packing
An intelligent document packing strategy based on the "Fewer Truncations Improve Language Modeling" paper by Ding et al. is employed.
def pack_documents(tokenized_documents):
sorted_docs = sorted(tokenized_documents, key=len, reverse=True)
bins = []
def find_bin(doc):
for b in bins:
if sum(len(d) for d in b) + len(doc) <= 8192:
return b
return None
for doc in sorted_docs:
target_bin = find_bin(doc)
if target_bin is not None:
target_bin.append(doc)
else:
bins.append([doc])
return bins
Model Configurations
- Base model with continued pretraining
- Long - context version (32k context length)
- Instruction - tuned version of the base model
- Instruction - tuned version of the long - context model
- Experimental
DARE - TIES
Merge with Llama3 - Instruct
- Collection of Quantized versions
🔧 Technical Details
The README provides detailed information about model training, data processing, evaluation, and various techniques used, which can be considered as technical details. For example, the long - context extension, document packing strategy, and hyperparameter settings all contribute to the technical implementation of the model.
📄 License
The model uses the Llama3 license.