GottBERT_base_last Open-Source German Language Model - Free Support for German Text Processing Applications

Gottbert Base Last

Developed by TUM

GottBERT is the first RoBERTa model specifically designed for German, pre-trained on the German portion of the OSCAR dataset, available in both base and large versions.

Large Language Model GermanOpen Source License:MIT #German RoBERTa #Large-scale pre-training #Multi-task optimization

Downloads 6,842

Release Time : 3/2/2022

Model Overview

GottBERT is a pure German language model aimed at enhancing performance for German natural language processing tasks such as named entity recognition, text classification, and natural language inference.

Model Features

Pure German Optimization

Designed specifically for German, pre-trained on the German OSCAR dataset for more accurate German language understanding.

Dual Version Options

Offers a base version (125M parameters) and a large version (355M parameters) to meet different computational needs.

Efficient Filtering

Improves model quality by filtering noisy data using metrics like stopword ratio, punctuation ratio, and uppercase word ratio.

High-Performance Tokenizer

Uses a GPT-2 Byte Pair Encoding (BPE) tokenizer with a vocabulary size of 52k subword units.

Model Capabilities

German text understanding

Named entity recognition

Text classification

Natural language inference

Use Cases

Natural Language Processing

Named Entity Recognition

Identify named entities (e.g., person names, locations, organizations) in German text.

Achieves F1 scores of 86.14 (base) and 86.78 (large) on the CoNLL 2003 dataset.

Text Classification

Classify German texts (e.g., news classification, sentiment analysis).

Achieves F1 scores of 78.65 (base) and 79.40 (large) on GermEval 2018 (coarse-grained).

Natural Language Inference

Determine the logical relationship between German text pairs (e.g., entailment, contradiction, neutral).

Achieves accuracy of 80.82 (base) and 82.46 (large) on the XNLI German subset.

🚀 GottBERT: A pure German language model

GottBERT is the first German-only RoBERTa model, pre-trained on the German portion of the first released OSCAR dataset. It aims to enhance natural language processing (NLP) performance for German across various tasks, such as Named Entity Recognition (NER), text classification, and natural language inference (NLI). There are two versions of GottBERT: a base model and a large model, both tailored for German-language tasks.

Property	Details
Model Type	RoBERTa
Language	German
Base Model	12 layers, 125 million parameters
Large Model	24 layers, 355 million parameters
License	MIT

This model was introduced in GottBERT: a pure German Language Model.

📚 Documentation

Pretraining Details

Corpus: German portion of the OSCAR dataset (Common Crawl).
Data Size:
- Unfiltered: 145GB (~459 million documents)
- Filtered: 121GB (~382 million documents)
Preprocessing: Filtering involved correcting encoding errors (e.g., erroneous umlauts), removing spam and non-German documents using language detection and syntactic filtering.

Filtering Metrics

Stopword Ratio: Detects spam and meaningless content.
Punctuation Ratio: Detects abnormal punctuation patterns.
Upper Token Ratio: Identifies documents with excessive uppercase tokens (often noisy content).

Training Configuration

Framework: Fairseq
Hardware:
- Base Model: 256 TPUv3 pod/128 TPUv4 pod
- Large Model: 128 TPUv4 pod
Training Time:
- Base Model: 1.2 days
- Large Model: 5.7 days
Batch Size: 8k tokens
Learning Rate:
- Base: Peak LR = 0.0004
- Large: Peak LR = 0.00015
Training Iterations: 100k steps with a 10k warm-up phase

Evaluation and Results

GottBERT was evaluated on various downstream tasks:

NER: CoNLL 2003, GermEval 2014
Text Classification: GermEval 2018 (coarse & fine), 10kGNAD
NLI: German subset of XNLI

Metrics:

NER and Text Classification: F1 Score
NLI: Accuracy

Details:

Bold values indicate the best performing model within one architecture (base, large), underlined values the second best.

Model	Accuracy NLI	GermEval_14 F1	CoNLL F1	Coarse F1	Fine F1	10kGNAD F1
GottBERT_base_best	80.82	87.55	85.93	78.17	53.30	89.64
GottBERT_base_last	81.04	87.48	85.61	78.18	53.92	90.27
GottBERT_filtered_base_best	80.56	87.57	86.14	78.65	52.82	89.79
GottBERT_filtered_base_last	80.74	87.59	85.66	78.08	52.39	89.92
GELECTRA_base	81.70	86.91	85.37	77.26	50.07	89.02
GBERT_base	80.06	87.24	85.16	77.37	51.51	90.30
dbmdzBERT	68.12	86.82	85.15	77.46	52.07	90.34
GermanBERT	78.16	86.53	83.87	74.81	47.78	90.18
XLM-R_base	79.76	86.14	84.46	77.13	50.54	89.81
mBERT	77.03	86.67	83.18	73.54	48.32	88.90
GottBERT_large	82.46	88.20	86.78	79.40	54.61	90.24
GottBERT_filtered_large_best	83.31	88.13	86.30	79.32	54.70	90.31
GottBERT_filtered_large_last	82.79	88.27	86.28	78.96	54.72	90.17
GELECTRA_large	86.33	88.72	86.78	81.28	56.17	90.97
GBERT_large	84.21	88.72	87.19	80.84	57.37	90.74
XLM-R_large	84.07	88.83	86.54	79.05	55.06	90.17

Model Architecture

Base Model: 12 layers, 125M parameters, 52k token vocabulary.
Large Model: 24 layers, 355M parameters, 52k token vocabulary.

Tokenizer

Type: GPT-2 Byte-Pair Encoding (BPE)
Vocabulary Size: 52k subword tokens
Trained on: 40GB subsample of the unfiltered German OSCAR corpus.

Limitations

Filtered vs Unfiltered Data: Filtered data shows minor improvements, but not significant enough to justify filtering in all cases.
Computation Limitations: Fixed memory allocation on TPUs requires processing data as a single stream, unlike GPU training which preserves document boundaries. Due to framework limitations, training was performed in 32-bit mode, increasing memory usage.

📄 License

The model is released under the MIT license.

🔗 Fairseq Checkpoints

Get the fairseq checkpoints here.

📖 Citations

If you use GottBERT in your research, please cite the following paper:

@inproceedings{scheible-etal-2024-gottbert,
    title = "{G}ott{BERT}: a pure {G}erman Language Model",
    author = "Scheible, Raphael  and
      Frei, Johann  and
      Thomczyk, Fabian  and
      He, Henry  and
      Tippmann, Patric  and
      Knaus, Jochen  and
      Jaravine, Victor  and
      Kramer, Frank  and
      Boeker, Martin",
    editor = "Al-Onaizan, Yaser  and
      Bansal, Mohit  and
      Chen, Yun-Nung",
    booktitle = "Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing",
    month = nov,
    year = "2024",
    address = "Miami, Florida, USA",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.emnlp-main.1183",
    pages = "21237--21250",
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご