đ Model Card for ehri-ner/xlm-roberta-large-ehri-ner-all
The xlm-roberta-large-ehri-ner-all model is fine-tuned for Holocaust-related Named Entity Recognition (NER) using a multilingual dataset, aiming to support Holocaust research and enhance the discoverability of relevant materials.
đ Documentation
Model Description
- Developed by: Dermentzi, M. & Scheithauer, H.
- Funded by: European Commission call H2020-INFRAIA-2018â2020. Grant agreement ID 871111. DOI 10.3030/871111.
- Language(s) (NLP): The model was fine-tuned on cs, de, en, fr, hu, nl, pl, sk, yi data but it may work for more languages due to the use of a multilingual base model (XLM-R) with cross-lingual transfer capabilities.
- License: EUPL-1.2
- Finetuned from model: FacebookAI/xlm-roberta-large
Model Information Table
Property |
Details |
Model Type |
Token Classification |
Training Data |
ehri-ner/ehri-ner-all |
Metrics (F1 Score) |
81.5% |
đ Quick Start
The European Holocaust Research Infrastructure (EHRI) aims to support Holocaust research by making information about dispersed Holocaust material accessible and interconnected through its services. Creating a tool capable of detecting named entities in texts such as Holocaust testimonies or archival descriptions would make it easier to link more material with relevant identifiers in domain-specific controlled vocabularies, semantically enriching it, and making it more discoverable.
The xlm-roberta-large-ehri-ner-all model finetunes XLM-RoBERTa (XLM-R) for Holocaust-related Named Entity Recognition (NER) using the EHRI-NER dataset, a multilingual dataset (Czech, German, English, French, Hungarian, Dutch, Polish, Slovak, Yiddish) for NER in Holocaust-related texts. The EHRI-NER dataset is built by aggregating all the annotated documents in the EHRI Online Editions and converting them to a format suitable for training NER models. The results of our experiments show that despite our relatively small dataset, in a multilingual experiment setup, the overall F1 score achieved by XLM-R fine-tuned on multilingual annotations is 81.5%.
⨠Features
- Multilingual Capability: Trained on multiple languages (Czech, German, English, French, Hungarian, Dutch, Polish, Slovak, Yiddish), and may work for more due to cross - lingual transfer capabilities of XLM - R.
- High F1 Score: Achieved an overall F1 score of 81.5% in a multilingual experiment setup, indicating good performance in named entity recognition.
đ§ Technical Details
The dataset used to fine-tune this model comes from the EHRI Online Editions, which are a series of manually annotated digital scholarly editions. Although these editions were not originally intended for training NER models, they are considered a high - quality resource for this purpose.
The fine - tuned model has some limitations. It occasionally misclassifies entities as non - entity tokens, with I - GHETTO being the most confused entity. It also has challenges in extracting multi - token entities like I - CAMP, I - LOC, and I - ORG, which are sometimes confused with the beginning of an entity. Moreover, it tends to misclassify B - GHETTO and B - CAMP as B - LOC, likely due to their semantic similarity.
đ License
This model is released under the EUPL - 1.2 license.
Uses
This model was developed for research purposes in the context of the EHRI - 3 project. The aim was to determine whether a single model can recognize entities across different document types and languages in Holocaust - related texts.
The results show that the model has a high enough F1 score (81.5%) to consider further deployment. Once a stable model is achieved and approved by EHRI stakeholders, it is intended to be used in an EHRI editorial pipeline. When text is input into a tool supporting this model, potential named entities in the text will be automatically pre - annotated. This helps researchers and professional archivists detect entities faster and link them to relevant controlled vocabulary entities from the custom EHRI controlled vocabularies and authority sets. It can facilitate metadata enrichment of descriptions in the EHRI Portal, enhance their discoverability, and make it easier for EHRI to develop new Online Editions and organize research data.
Limitations
- Dataset Limitation: The dataset for fine - tuning is from the EHRI Online Editions, which were not originally designed for training NER models.
- Model Performance: The model has issues in entity classification, such as misclassifying certain entities and having difficulties with multi - token entities.
- Usage Scope: It is mainly envisioned for EHRI - related editorial and publishing pipelines and may not be suitable for other users or organizations.
Recommendations
đĄ Usage Tip
For more information, we encourage potential users to read the paper accompanying this model: Dermentzi, M., & Scheithauer, H. (2024, May). Repurposing Holocaust - Related Digital Scholarly Editions to Develop Multilingual Domain - Specific Named Entity Recognition Tools. LREC - COLING 2024 - Joint International Conference on Computational Linguistics, Language Resources and Evaluation. HTRes@LREC - COLING 2024, Torino, Italy. [https://hal.science/hal - 04547222](https://hal.science/hal - 04547222)
Citation
BibTeX:
@inproceedings{dermentzi_repurposing_2024,
address = {Torino, Italy},
title = {Repurposing {Holocaust}-{Related} {Digital} {Scholarly} {Editions} to {Develop} {Multilingual} {Domain}-{Specific} {Named} {Entity} {Recognition} {Tools}},
url = {https://hal.science/hal-04547222},
abstract = {The European Holocaust Research Infrastructure (EHRI) aims to support Holocaust research by making information about dispersed Holocaust material accessible and interconnected through its services. Creating a tool capable of detecting named entities in texts such as Holocaust testimonies or archival descriptions would make it easier to link more material with relevant identifiers in domain-specific controlled vocabularies, semantically enriching it, and making it more discoverable. With this paper, we release EHRI-NER, a multilingual dataset (Czech, German, English, French, Hungarian, Dutch, Polish, Slovak, Yiddish) for Named Entity Recognition (NER) in Holocaust-related texts. EHRI-NER is built by aggregating all the annotated documents in the EHRI Online Editions and converting them to a format suitable for training NER models. We leverage this dataset to fine-tune the multilingual Transformer-based language model XLM-RoBERTa (XLM-R) to determine whether a single model can be trained to recognize entities across different document types and languages. The results of our experiments show that despite our relatively small dataset, in a multilingual experiment setup, the overall F1 score achieved by XLM-R fine-tuned on multilingual annotations is 81.5{\textbackslash}\%. We argue that this score is sufficiently high to consider the next steps towards deploying this model.},
urldate = {2024-04-29},
booktitle = {{LREC}-{COLING} 2024 - {Joint} {International} {Conference} on {Computational} {Linguistics}, {Language} {Resources} and {Evaluation}},
publisher = {ELRA Language Resources Association (ELRA); International Committee on Computational Linguistics (ICCL)},
author = {Dermentzi, Maria and Scheithauer, Hugo},
month = may,
year = {2024},
keywords = {Digital Editions, Holocaust Testimonies, Multilingual, Named Entity Recognition, Transfer Learning, Transformers},
}
APA:
Dermentzi, M., & Scheithauer, H. (2024, May). Repurposing Holocaust - Related Digital Scholarly Editions to Develop Multilingual Domain - Specific Named Entity Recognition Tools. LREC - COLING 2024 - Joint International Conference on Computational Linguistics, Language Resources and Evaluation. HTRes@LREC - COLING 2024, Torino, Italy. [https://hal.science/hal - 04547222](https://hal.science/hal - 04547222)