Gerpt2-large Open Source German Text Generation Model - Free Support for High-Quality German Text Generation

Gerpt2 Large

Developed by benjamin

GerPT2 is the large-scale version of the German GPT2, trained on the CC-100 corpus and German Wikipedia, excelling in German text generation tasks.

Large Language Model GermanOpen Source License:MIT #German Text Generation #Low Perplexity Model #GPT2 Architecture Optimization

Downloads 75

Release Time : 3/2/2022

Model Overview

GerPT2 is a German language model based on the GPT2 architecture, available in both large and small versions, focusing on German text generation tasks.

Model Features

Excellent German Text Generation Capability

Significantly outperforms similar German GPT2 models in perplexity on the CC-100 and German Wikipedia datasets.

Trained on Large-scale German Corpus

Utilizes the entire German data from the CC-100 corpus and German Wikipedia for training.

Optimized Training Strategy

Employs OneCycle learning rate scheduling and AdamW optimizer with weight decay, trained for 2 epochs.

Model Capabilities

German Text Generation

German Text Continuation

German Language Understanding

Use Cases

Content Creation

German Article Generation

Generates coherent German articles based on prompts

High-quality generated text with low perplexity

German Dialogue System

Used for building German chatbots

Education

German Learning Assistance

Generates German learning materials and exercises

🚀 GerPT2

German large and small versions of GPT2, offering high - performance language processing capabilities.

🚀 Quick Start

GerPT2 provides German large and small versions of GPT2. You can access them through the following links:

https://huggingface.co/benjamin/gerpt2
https://huggingface.co/benjamin/gerpt2-large

For considerations on limitations and bias, refer to the GPT2 model card. For details on GPT2, see the GPT2 documentation.

✨ Features

Comparison to dbmdz/german-gpt2

The author evaluated both GerPT2 - large and dbmdz/german - gpt2 on the CC - 100 dataset and the German Wikipedia. The results are as follows:

Property	CC - 100 (PPL)	Wikipedia (PPL)
dbmdz/german - gpt2	49.47	62.92
GerPT2	24.78	35.33
GerPT2 - large	16.08	23.26

You can find the evaluation code in the evaluate.py script in the GerPT2 Github repository.

💻 Usage Examples

Basic Usage

from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

tokenizer = AutoTokenizer.from_pretrained("benjamin/gerpt2-large")
model = AutoModelForCausalLM.from_pretrained("benjamin/gerpt2-large")

prompt = "<your prompt>"

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
print(pipe(prompt)[0]["generated_text"])

Advanced Usage

import torch

output = model.generate(
    # during training an EOS token was used to mark the beginning of each text
    # so it can help to insert it at the start
    torch.tensor(
        [tokenizer.eos_token_id] + tokenizer.encode(prompt)
    ).unsqueeze(0),
    do_sample=True,
    # try setting bad_words_ids=[[0]] to disallow generating an EOS token, without this the model is
    # prone to ending generation early because a significant number of texts from the training corpus
    # is quite short
    bad_words_ids=[[0]],
    max_length=max_length,
)[0]
print(tokenizer.decode(output))

🔧 Technical Details

Training Details

GerPT2 - large is trained on the entire German data from the CC - 100 Corpus, and its weights are initialized from the English GPT2 model.

GerPT2 - large was trained with the following settings:

A batch size of 256
OneCycle learning rate with a maximum of 5e - 3
AdamW with a weight decay of 0.01
For 2 epochs

The training took roughly 12 days on 8 TPUv3 cores.

To train GerPT2 - large, follow these steps. The scripts are located in the Github repository: 0. Download and unzip training data from http://data.statmt.org/cc-100/.

Train a tokenizer using prepare/train_tokenizer.py. The author used a random subset of 5% of the CC - 100 data as training data for the tokenizer.
(Optionally) Generate a German input embedding matrix with prepare/generate_aligned_wte.py. This script uses a neat trick to semantically map tokens from the English tokenizer to tokens from the German tokenizer using aligned word embeddings. For example:

ĠMinde -> Ġleast
Ġjed -> Ġwhatsoever
flughafen -> Air
vermittlung -> employment
teilung -> ignment
ĠInterpretation -> Ġinterpretation
Ġimport -> Ġimported
hansa -> irl
genehmigungen -> exempt
ĠAuflist -> Ġlists
Ġverschwunden -> Ġdisappeared
ĠFlyers -> ĠFlyers
Kanal -> Channel
Ġlehr -> Ġteachers
Ġnahelie -> Ġconvenient
gener -> Generally
mitarbeiter -> staff

This approach helped a lot in a trial run, though a full comparison was not possible due to budget and time constraints. You can pass the WTE matrix via the wte_path to the training script. Credit to this blogpost for the idea of initializing GPT2 from English weights. 3. Tokenize the corpus using prepare/tokenize_text.py. This generates files for train and validation tokens in JSON Lines format. 4. Run the training script train.py! run.sh shows how this was executed for the full run with config configs/tpu_large.json.

📄 License

GerPT2 is licensed under the MIT License.

📚 Documentation

Citing

Please cite GerPT2 as follows:

@misc{Minixhofer_GerPT2_German_large_2020,
author = {Minixhofer, Benjamin},
doi = {10.5281/zenodo.5509984},
month = {12},
title = {{GerPT2: German large and small versions of GPT2}},
url = {https://github.com/bminixhofer/gerpt2},
year = {2020}
}

Acknowledgements

Thanks to Hugging Face for providing awesome tools and infrastructure. Huge thanks to Artus Krohn - Grimberghe at LYTiQ for sponsoring the resources used for training, making this project possible.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご