🚀 Bilingual English + German SQuAD2.0
We've created a bilingual English and German training dataset for question answering by merging German Squad 2.0 (deQuAD 2.0) with SQuAD2.0. Then, we fine - tuned it on a bilingual QA downstream task using the bert - base - multilingual - cased model.
🚀 Quick Start
We've crafted a bilingual dataset by combining German Squad 2.0 (deQuAD 2.0) with SQuAD2.0. This dataset is used for fine - tuning the bert - base - multilingual - cased model on a bilingual question - answering downstream task.
✨ Features
- Bilingual Dataset: Merged German deQuAD 2.0 with English SQuAD 2.0 for bilingual question - answering training.
- Quality Assurance: Professional editors proofread the translated German transcripts, ensuring high - quality annotations.
- Model Fine - Tuning: Fine - tuned the bert - base - multilingual - cased model on the bilingual dataset.
📦 Installation
No specific installation steps are provided in the original README.
💻 Usage Examples
Basic Usage
from transformers import pipeline
qa_pipeline = pipeline(
"question-answering",
model="deutsche-telekom/bert-multi-english-german-squad2",
tokenizer="deutsche-telekom/bert-multi-english-german-squad2"
)
contexts = ["Die Allianz Arena ist ein Fußballstadion im Norden von München und bietet bei Bundesligaspielen 75.021 Plätze, zusammengesetzt aus 57.343 Sitzplätzen, 13.794 Stehplätzen, 1.374 Logenplätzen, 2.152 Business Seats und 966 Sponsorenplätzen. In der Allianz Arena bestreitet der FC Bayern München seit der Saison 2005/06 seine Heimspiele. Bis zum Saisonende 2017 war die Allianz Arena auch Spielstätte des TSV 1860 München.",
"Harvard is a large, highly residential research university. It operates several arts, cultural, and scientific museums, alongside the Harvard Library, which is the world's largest academic and private library system, comprising 79 individual libraries with over 18 million volumes. "]
questions = ["Wo befindet sich die Allianz Arena?",
"What is the worlds largest academic and private library system?"]
qa_pipeline(context=contexts, question=questions)
Output
[{'score': 0.7290093898773193,
'start': 44,
'end': 62,
'answer': 'Norden von München'},
{'score': 0.7979822754859924,
'start': 134,
'end': 149,
'answer': 'Harvard Library'}]
📚 Documentation
Details of deQuAD 2.0
We auto - translated SQuAD2.0 into German. Professional editors were hired to proofread the translated transcripts, correct mistakes, and double - check the answers. This process polished the text and enhanced the annotation quality. The final German deQuAD dataset contains 130k training and 11k test samples.
Overview
Property |
Details |
Model Type |
bert - base - multilingual - cased |
Language |
German, English |
Training Data |
deQuAD2.0 + SQuAD2.0 training set |
Evaluation Data |
SQuAD2.0 test set; deQuAD2.0 test set |
Infrastructure |
8xV100 GPU |
Published |
July 9th, 2021 |
Evaluation on English SQuAD2.0
HasAns_exact = 85.79622132253711
HasAns_f1 = 90.92004586077663
HasAns_total = 5928
NoAns_exact = 94.76871320437343
NoAns_f1 = 94.76871320437343
NoAns_total = 5945
exact = 90.28889076054915
f1 = 92.84713483219753
total = 11873
Evaluation on German deQuAD2.0
HasAns_exact = 63.80526406330638
HasAns_f1 = 72.47269140789888
HasAns_total = 5813
NoAns_exact = 82.0291893792861
NoAns_f1 = 82.0291893792861
NoAns_total = 5687
exact = 72.81739130434782
f1 = 77.19858740470603
total = 11500
📄 License
The project is licensed under The MIT License. Copyright (c) 2021 Fang Xu, Deutsche Telekom AG