GerPT2: An Open-Source German Large Language Model - Trained on Multiple Datasets with Superior Performance Compared to Peers

Gerpt2

Developed by benjamin

GerPT2 is a large German language model based on the GPT2 architecture, trained on the CC-100 and German Wikipedia datasets, outperforming similar German GPT2 models.

Large Language Model GermanOpen Source License:MIT #German Text Generation #Low Perplexity Model #Transfer Learning Optimization

Downloads 48

Release Time : 3/2/2022

Model Overview

German version of the GPT2 large model, supporting German text generation tasks with excellent performance in perplexity metrics.

Model Features

Outstanding German Performance

Significantly outperforms the dbmdz/german-gpt2 model in perplexity metrics on the CC-100 and German Wikipedia datasets.

English-to-German Semantic Mapping

Achieves English-to-German word embedding alignment via the generate_aligned_wte.py script, improving generation quality.

Optimized Generation Control

Provides parameter configurations such as bad_words_ids to effectively control the termination conditions of generated text.

Model Capabilities

German Text Generation

Context Understanding

Long Text Generation

Use Cases

Content Creation

German Article Generation

Generates coherent German articles based on prompts.

Generated texts exhibit low perplexity metrics.

Language Research

German Language Model Research

Serves as a baseline model for German NLP research.

Provides benchmark performance superior to similar models.

🚀 GerPT2

German large and small versions of GPT2, offering powerful text generation capabilities.

German large and small versions of GPT2 are available at the following links:

https://huggingface.co/benjamin/gerpt2
https://huggingface.co/benjamin/gerpt2-large

Refer to the GPT2 model card for considerations regarding limitations and bias. Check the GPT2 documentation for detailed information about GPT2.

🚀 Quick Start

💻 Usage Examples

Basic Usage

from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

tokenizer = AutoTokenizer.from_pretrained("benjamin/gerpt2-large")
model = AutoModelForCausalLM.from_pretrained("benjamin/gerpt2-large")

prompt = "<your prompt>"

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
print(pipe(prompt)[0]["generated_text"])

Advanced Usage

import torch

output = model.generate(
    # during training an EOS token was used to mark the beginning of each text
    # so it can help to insert it at the start
    torch.tensor(
        [tokenizer.eos_token_id] + tokenizer.encode(prompt)
    ).unsqueeze(0),
    do_sample=True,
    # try setting bad_words_ids=[[0]] to disallow generating an EOS token, without this the model is
    # prone to ending generation early because a significant number of texts from the training corpus
    # is quite short
    bad_words_ids=[[0]],
    max_length=max_length,
)[0]
print(tokenizer.decode(output))

✨ Features

Comparison to dbmdz/german-gpt2

I evaluated both GerPT2-large and the other German GPT2, dbmdz/german-gpt2 on the CC-100 dataset and on the German Wikipedia:

Property	CC-100 (PPL)	Wikipedia (PPL)
dbmdz/german-gpt2	49.47	62.92
GerPT2	24.78	35.33
GerPT2-large	16.08	23.26

See the script evaluate.py in the GerPT2 Github repository for the code.

🔧 Technical Details

Training details

GerPT2-large is trained on the entire German data from the CC-100 Corpus and weights were initialized from the English GPT2 model. GerPT2-large was trained with:

a batch size of 256
using OneCycle learning rate with a maximum of 5e-3
with AdamW with a weight decay of 0.01
for 2 epochs

Training took roughly 12 days on 8 TPUv3 cores.

To train GerPT2-large, follow these steps. Scripts are located in the Github repository: 0. Download and unzip training data from http://data.statmt.org/cc-100/.

Train a tokenizer using prepare/train_tokenizer.py. As training data for the tokenizer I used a random subset of 5% of the CC-100 data.
(optionally) generate a German input embedding matrix with prepare/generate_aligned_wte.py. This uses a neat trick to semantically map tokens from the English tokenizer to tokens from the German tokenizer using aligned word embeddings. E.g.:

ĠMinde -> Ġleast
Ġjed -> Ġwhatsoever
flughafen -> Air
vermittlung -> employment
teilung -> ignment
ĠInterpretation -> Ġinterpretation
Ġimport -> Ġimported
hansa -> irl
genehmigungen -> exempt
ĠAuflist -> Ġlists
Ġverschwunden -> Ġdisappeared
ĠFlyers -> ĠFlyers
Kanal -> Channel
Ġlehr -> Ġteachers
Ġnahelie -> Ġconvenient
gener -> Generally
mitarbeiter -> staff

This helps a lot on a trial run I did, although I wasn't able to do a full comparison due to budget and time constraints. To use this WTE matrix it can be passed via the wte_path to the training script. Credit to this blogpost for the idea of initializing GPT2 from English weights. 3. Tokenize the corpus using prepare/tokenize_text.py. This generates files for train and validation tokens in JSON Lines format. 4. Run the training script train.py! run.sh shows how this was executed for the full run with config configs/tpu_large.json.

📄 License

GerPT2 is licensed under the MIT License.

📚 Documentation

Citing

Please cite GerPT2 as follows:

@misc{Minixhofer_GerPT2_German_large_2020,
author = {Minixhofer, Benjamin},
doi = {10.5281/zenodo.5509984},
month = {12},
title = {{GerPT2: German large and small versions of GPT2}},
url = {https://github.com/bminixhofer/gerpt2},
year = {2020}
}

Acknowledgements

Thanks to Hugging Face for awesome tools and infrastructure. Huge thanks to Artus Krohn-Grimberghe at LYTiQ for making this possible by sponsoring the resources used for training.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご