The open-source German language model leo-hessianai-7b-chat - Exclusive German language processing for open commercial use

Home

Leo Hessianai 7b Chat

Developed by LeoLM

The first open commercial-use German base language model built on Llama-2, focusing on German language processing

Large Language Model

Transformers

Supports Multiple Languages#German large model #Multi-round dialogue optimization #Commercially available

Downloads 2,263

Release Time : 9/10/2023

Model Overview

LeoLM is a German large language model built on the Llama-2 architecture. By continuously pre-training on a large German corpus, it expands the German capabilities of Llama-2. This model is particularly suitable for German text generation and understanding tasks.

Model Features

German optimization

Specifically optimized and trained for German language characteristics, performing excellently on German tasks

Long context support

Supports long context processing capabilities of 8k tokens

Business-friendly

Adopts the Llama-2 community license, allowing commercial use

Dialogue optimization

The chat model version is specifically fine-tuned for German dialogue scenarios

Model Capabilities

German text generation

Multi-round dialogue processing

German text understanding

Writing assistance

Content explanation

Use Cases

Content creation

German article writing

Generate high-quality German articles and content

Scored 5.875 points in the writing category (MT-Bench-DE score)

Poetry creation

Generate German poetry and lyrics

The training data contains 490 German poetry samples

Educational assistance

Concept explanation

Explain complex concepts and topics in German

Scored a high score of 7.625 in the humanities category

Customer service

German customer service robot

Build a German customer service dialogue system

Scored 6.3 points in the role-playing category

🚀 LAION LeoLM: Linguistically Enhanced Open Language Model

Meet LeoLM, the first open and commercially available German Foundation Language Model built on Llama-2. It extends Llama-2's capabilities into German through continued pretraining on a large corpus of German-language and mostly locality specific text. With the support of a compute grant at HessianAI's new supercomputer 42, we're releasing two foundation models with an 8k context length: LeoLM/leo-hessianai-7b and LeoLM/leo-hessianai-13b under the Llama-2 community license (a 70b version is also coming soon! 😎). This release aims to bring new opportunities to German open-source and commercial LLM research and accelerate adoption. For more details, check out our blog post or our upcoming paper (preprint soon).

A project by Björn Plüster and Christoph Schuhmann in collaboration with LAION and HessianAI.

🚀 Quick Start

✨ Features

Built on Llama-2, extending its capabilities to the German language.
Two foundation models with 8k context length, and a 70b model on the way.
A chat model LeoLM/leo-hessianai-7b-chat finetuned for German instruction tasks.

📦 Installation

First, install the direct dependencies:

pip install transformers torch sentencepiece

If you want faster inference using flash-attention2, install these additional dependencies:

pip install packaging ninja
pip install flash-attn==v2.1.1 --no-build-isolation
pip install git+https://github.com/HazyResearch/flash-attention.git@v2.1.1#subdirectory=csrc/rotary

💻 Usage Examples

Basic Usage

from transformers import pipeline
import torch

system_prompt = """<|im_start|>system
Dies ist eine Unterhaltung zwischen einem intelligenten, hilfsbereitem KI-Assistenten und einem Nutzer.
Der Assistent gibt ausführliche, hilfreiche und ehrliche Antworten.<|im_end|>

"""
prompt_format = "<|im_start|>user\n{prompt}<|im_end|>\n<|im_start|>assistant\n"
prompt = "Erkläre mir wie die Fahrradwegesituation in Hamburg ist."

generator = pipeline(model="LeoLM/leo-hessianai-7b-chat", device="cuda", torch_dtype=torch.float16, trust_remote_code=True) # True for flash-attn2 else False
print(generator(prompt_format.format(prompt=prompt), do_sample=True, top_p=0.95, max_length=8192))

The model's response: "Als KI kann ich keine persönlichen Beobachtungen teilen, aber ich kann einige allgemeine Informationen zur Fahrradwegesituation in Hamburg liefern. Im Vergleich zu vielen anderen großen Städten hat Hamburg eine hohe Anzahl von Fahrradfahrern und nimmt seine Verantwortung für nachhaltige Verkehrsmittel sehr ernst. Es gibt viele Fahrradwege und separate Fahrspuren, die Radfahrern ermöglichen, zusammen mit dem Autoverkehr zu fahren. Diese Fahrradspuren sind oft mit Markierungen gekennzeichnet und durch physische Trennungen von anderen Fahrspuren abgegrenzt. Darüber hinaus gibt es viele Fahrradstraßen, auf denen Radfahrer Vorfahrt haben und Autos langsamer fahren müssen.

In einigen städtischen Gebieten können Fahrradwege jedoch eng oder überfüllt sein, besonders während der Stoßzeiten. Es gibt auch viele Kreuzungen, an denen Radfahrer anhalten und auf Grün warten müssen, ähnlich wie Autofahrer. Insgesamt ist die Fahrradinfrastruktur in Hamburg ziemlich gut, aber wie überall gibt es immer Raum für Verbesserungen."

📚 Documentation

LeoLM Chat

LeoLM/leo-hessianai-7b-chat is a German chat model based on the foundation model LeoLM/leo-hessianai-7b, finetuned on selected German instruction datasets. It performs well on writing, explanation, and discussion tasks but has some challenges with math and advanced reasoning. See our MT-Bench-DE scores:

{
    "first_turn": 5.75,
    "second_turn": 4.45,
    "categories": {
        "writing": 5.875,
        "roleplay": 6.3,
        "reasoning": 3.5,
        "math": 2.85,
        "coding": 2.95,
        "extraction": 4.3,
        "stem": 7.4,
        "humanities": 7.625
    },
    "average": 5.1
}

Model Details

Property	Details
Finetuned from	LeoLM/leo-hessianai-7b
Model Type	Causal decoder-only transformer language model
Language	English and German
Demo	Web Demo
License	LLAMA 2 COMMUNITY LICENSE AGREEMENT
Contact	LAION Discord or Björn Plüster

Prompting / Prompt Template

Prompt dialogue template (ChatML format):

"""
<|im_start|>system
{system_message}<|im_end|>
<|im_start|>user
{prompt}<|im_end|>
<|im_start|>assistant
"""

The model input can contain multiple conversation turns between the user and the assistant, for example:

<|im_start|>user
{prompt 1}<|im_end|>
<|im_start|>assistant
{reply 1}<|im_end|>
<|im_start|>user
{prompt 2}<|im_end|>
<|im_start|>assistant
(...)

🔧 Technical Details

Finetuning Details

Hyperparameter	Value
Num epochs	3
Examples per epoch	131214
Global batch size	256
Learning rate	3e-5
Warmup steps	100
LR scheduler	Cosine
Adam betas	(0.9, 0.95)

Dataset Details

## Stats for 'Subset of OpenAssistant/OASST-DE' (3534 samples (100.0%))
-----------------
  Accepted: 3534/3534 (100.0%)
  Accepted tokens: 2259302
  Skipped: 0 (0.0%)
  Min tokens per sample: 29
  Max tokens per sample: 2484
  Avg tokens per sample: 639.3044708545557
-----------------

## Stats for 'Subset of FreedomIntelligence/evol-instruct-deutsch' (57841 samples (100.0%))
-----------------
  Accepted: 57841/57841 (100.0%)
  Accepted tokens: 42958192
  Skipped: 0 (0.0%)
  Min tokens per sample: 33
  Max tokens per sample: 5507
  Avg tokens per sample: 742.6944900675991
-----------------

## Stats for 'Subset of FreedomIntelligence/alpaca-gpt4-deutsch' (48969 samples (100.0%))
-----------------
  Accepted: 48969/48969 (100.0%)
  Accepted tokens: 13372005
  Skipped: 0 (0.0%)
  Min tokens per sample: 19
  Max tokens per sample: 1359
  Avg tokens per sample: 273.07082031489307
-----------------

## Stats for 'Subset of LeoLM/OpenSchnabeltier' (21314 samples (100.0%))
-----------------
  Accepted: 21314/21314 (100.0%)
  Accepted tokens: 8134690
  Skipped: 0 (0.0%)
  Min tokens per sample: 25
  Max tokens per sample: 1202
  Avg tokens per sample: 381.65947264708643
-----------------

## Stats for 'Subset of LeoLM/German_Poems' (490 samples (100.0%))
-----------------
  Accepted: 490/490 (100.0%)
  Accepted tokens: 618642
  Skipped: 0 (0.0%)
  Min tokens per sample: 747
  Max tokens per sample: 1678
  Avg tokens per sample: 1262.534693877551
-----------------

## Stats for 'Subset of LeoLM/German_Songs' (392 samples (100.0%))
-----------------
  Accepted: 392/392 (100.0%)
  Accepted tokens: 187897
  Skipped: 0 (0.0%)
  Min tokens per sample: 231
  Max tokens per sample: 826
  Avg tokens per sample: 479.3290816326531
-----------------

## Stats for 'total' (132540 samples (100.0%))
-----------------
  Accepted: 132540/132540 (100.0%)
  Accepted tokens: 67530728
  Skipped: 0 (0.0%)
  Min tokens per sample: 19
  Max tokens per sample: 5507
  Avg tokens per sample: 509.51205673758864
-----------------

📄 License

This project is released under the LLAMA 2 COMMUNITY LICENSE AGREEMENT.

⚠️ Important Note

LeoLM has been tested in English and German, but it cannot cover all scenarios. As with all LLMs, the potential outputs of LeoLM/leo-hessianai-7b-chat cannot be predicted in advance, and the model may produce inaccurate, biased, or other objectionable responses to user prompts. Therefore, developers should perform safety testing and tuning tailored to their specific applications of the model before deployment. Please refer to Meta's Responsible Use Guide.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご