🚀 SpanMarker for GermEval 2014 NER
This is a SpanMarker model fine-tuned on the GermEval 2014 NER Dataset. It aims to solve named entity recognition tasks in German texts, providing accurate identification of named entities in the dataset.
✨ Features
- Fine-tuned on GermEval 2014: The model is specifically fine-tuned on the GermEval 2014 NER Dataset, which is based on German Wikipedia and News Corpora, covering over 31,000 sentences and 590,000 tokens.
- 12 Named Entity Classes: It can recognize 12 classes of named entities, including four main classes (
PER
, LOC
, ORG
, OTH
) and their sub - classes with fine - grained labels.
- Accurate Evaluation: Evaluation is performed using both SpanMarker's internal evaluation code with
seqeval
and the official GermEval 2014 Evaluation Script.
📦 Installation
The installation steps are not provided in the original document, so this section is skipped.
💻 Usage Examples
Basic Usage
from span_marker import SpanMarkerModel
model = SpanMarkerModel.from_pretrained("stefan-it/span-marker-gelectra-large-germeval14")
entities = model.predict("Jürgen Schmidhuber studierte ab 1983 Informatik und Mathematik an der TU München .")
📚 Documentation
Dataset Introduction
The GermEval 2014 NER Shared Task uses a new dataset with German Named Entity annotation. The data was sampled from German Wikipedia and News Corpora as a collection of citations. The NER annotation follows the NoSta - D guidelines, which extend the Tübingen Treebank guidelines. It uses four main NER categories with sub - structure and annotates embeddings among NEs, such as [ORG FC Kickers [LOC Darmstadt]]
.
Named Entity Classes
12 classes of Named Entities are annotated and recognized. There are four main classes: PER
son, LOC
ation, ORG
anisation, and OTH
er. Sub - classes are introduced with two fine - grained labels: -deriv
marks derivations from NEs (e.g., "englisch"), and -part
marks compounds including a NE as a subsequence (e.g., deutschlandweit).
Fine - Tuning
We use the same hyper - parameters as in the "German's Next Language Model" paper, with the GELECTRA Large model as the backbone.
Evaluation is carried out using SpanMarker's internal evaluation code with seqeval
and the official GermEval 2014 Evaluation Script. A backup of the nereval.py
script can be found here.
We fine - tune 5 models and upload the model with the best F1 - Score on the development set. The results on the development set are as follows:
Model |
Run 1 |
Run 2 |
Run 3 |
Run 4 |
Run 5 |
Avg. |
GELECTRA Large (5e - 05) |
(89.99) / 89.08 |
(89.55) / 89.23 |
(89.60) / 89.10 |
(89.34) / 89.02 |
(89.68) / 88.80 |
(89.63) / 89.05 |
The best model achieves a final test score of 89.08%:
1. Strict, Combined Evaluation (official):
Accuracy: 99.26%;
Precision: 89.01%;
Recall: 89.16%;
FB1: 89.08
Scripts for training and evaluation are also available.
🔧 Technical Details
The model uses the SpanMarker architecture and fine - tunes on the GELECTRA Large model. The hyper - parameters are set according to the "German's Next Language Model" paper. The evaluation is based on seqeval
and the official GermEval 2014 Evaluation Script.
📄 License
This project is licensed under the MIT license.
📊 Model Information