xlm-roberta-large-ehri-ner-all Open Source Model - Supports Holocaust Named Entity Recognition in 9 Languages

Xlm Roberta Large Ehri Ner All

Developed by ehri-ner

A multilingual Holocaust-related named entity recognition model fine-tuned based on XLM-RoBERTa, supporting 9 languages with an F1 score of 81.5%.

Sequence Labeling

Transformers

Supports Multiple Languages#Holocaust NER #Multilingual Entity Recognition #Historical Archive Annotation

Downloads 208

Release Time : 3/5/2024

Model Overview

This model is used to identify named entities in Holocaust-related texts, supporting multiple languages, aiming to enrich the metadata of the EHRI portal through automatic annotation and enhance its discoverability.

Model Features

Multilingual Support

Supports named entity recognition in 9 languages, including Czech, German, English, etc.

High Accuracy

Achieves an overall F1 score of 81.5% in multilingual experimental settings.

Domain-Specific

Focuses on named entity recognition for Holocaust-related texts, suitable for academic research and archive management.

Model Capabilities

Named Entity Recognition

Multilingual Text Processing

Automatic Annotation

Use Cases

Academic Research

Holocaust Studies

Identifies named entities in Holocaust testimonies or archival descriptions, facilitating research and analysis.

Improves research efficiency and enhances material discoverability.

Archive Management

Metadata Enrichment

Automatically annotates named entities in texts, linking them to controlled vocabularies and authority sets.

Enriches metadata, improving the discoverability and organizational efficiency of archival materials.

🚀 Model Card for ehri-ner/xlm-roberta-large-ehri-ner-all

The xlm-roberta-large-ehri-ner-all model is fine-tuned for Holocaust-related Named Entity Recognition (NER) using a multilingual dataset, aiming to support Holocaust research and enhance the discoverability of relevant materials.

📚 Documentation

Model Description

Developed by: Dermentzi, M. & Scheithauer, H.
Funded by: European Commission call H2020-INFRAIA-2018–2020. Grant agreement ID 871111. DOI 10.3030/871111.
Language(s) (NLP): The model was fine-tuned on cs, de, en, fr, hu, nl, pl, sk, yi data but it may work for more languages due to the use of a multilingual base model (XLM-R) with cross-lingual transfer capabilities.
License: EUPL-1.2
Finetuned from model: FacebookAI/xlm-roberta-large

Model Information Table

Property	Details
Model Type	Token Classification
Training Data	ehri-ner/ehri-ner-all
Metrics (F1 Score)	81.5%

🚀 Quick Start

The European Holocaust Research Infrastructure (EHRI) aims to support Holocaust research by making information about dispersed Holocaust material accessible and interconnected through its services. Creating a tool capable of detecting named entities in texts such as Holocaust testimonies or archival descriptions would make it easier to link more material with relevant identifiers in domain-specific controlled vocabularies, semantically enriching it, and making it more discoverable.

The xlm-roberta-large-ehri-ner-all model finetunes XLM-RoBERTa (XLM-R) for Holocaust-related Named Entity Recognition (NER) using the EHRI-NER dataset, a multilingual dataset (Czech, German, English, French, Hungarian, Dutch, Polish, Slovak, Yiddish) for NER in Holocaust-related texts. The EHRI-NER dataset is built by aggregating all the annotated documents in the EHRI Online Editions and converting them to a format suitable for training NER models. The results of our experiments show that despite our relatively small dataset, in a multilingual experiment setup, the overall F1 score achieved by XLM-R fine-tuned on multilingual annotations is 81.5%.

✨ Features

Multilingual Capability: Trained on multiple languages (Czech, German, English, French, Hungarian, Dutch, Polish, Slovak, Yiddish), and may work for more due to cross - lingual transfer capabilities of XLM - R.
High F1 Score: Achieved an overall F1 score of 81.5% in a multilingual experiment setup, indicating good performance in named entity recognition.

🔧 Technical Details

The dataset used to fine-tune this model comes from the EHRI Online Editions, which are a series of manually annotated digital scholarly editions. Although these editions were not originally intended for training NER models, they are considered a high - quality resource for this purpose.

The fine - tuned model has some limitations. It occasionally misclassifies entities as non - entity tokens, with I - GHETTO being the most confused entity. It also has challenges in extracting multi - token entities like I - CAMP, I - LOC, and I - ORG, which are sometimes confused with the beginning of an entity. Moreover, it tends to misclassify B - GHETTO and B - CAMP as B - LOC, likely due to their semantic similarity.

📄 License

This model is released under the EUPL - 1.2 license.

Uses

This model was developed for research purposes in the context of the EHRI - 3 project. The aim was to determine whether a single model can recognize entities across different document types and languages in Holocaust - related texts.

The results show that the model has a high enough F1 score (81.5%) to consider further deployment. Once a stable model is achieved and approved by EHRI stakeholders, it is intended to be used in an EHRI editorial pipeline. When text is input into a tool supporting this model, potential named entities in the text will be automatically pre - annotated. This helps researchers and professional archivists detect entities faster and link them to relevant controlled vocabulary entities from the custom EHRI controlled vocabularies and authority sets. It can facilitate metadata enrichment of descriptions in the EHRI Portal, enhance their discoverability, and make it easier for EHRI to develop new Online Editions and organize research data.

Limitations

Dataset Limitation: The dataset for fine - tuning is from the EHRI Online Editions, which were not originally designed for training NER models.
Model Performance: The model has issues in entity classification, such as misclassifying certain entities and having difficulties with multi - token entities.
Usage Scope: It is mainly envisioned for EHRI - related editorial and publishing pipelines and may not be suitable for other users or organizations.

Recommendations

💡 Usage Tip

For more information, we encourage potential users to read the paper accompanying this model: Dermentzi, M., & Scheithauer, H. (2024, May). Repurposing Holocaust - Related Digital Scholarly Editions to Develop Multilingual Domain - Specific Named Entity Recognition Tools. LREC - COLING 2024 - Joint International Conference on Computational Linguistics, Language Resources and Evaluation. HTRes@LREC - COLING 2024, Torino, Italy. [https://hal.science/hal - 04547222](https://hal.science/hal - 04547222)

Citation

BibTeX:

@inproceedings{dermentzi_repurposing_2024,
	address = {Torino, Italy},
	title = {Repurposing {Holocaust}-{Related} {Digital} {Scholarly} {Editions} to {Develop} {Multilingual} {Domain}-{Specific} {Named} {Entity} {Recognition} {Tools}},
	url = {https://hal.science/hal-04547222},
	abstract = {The European Holocaust Research Infrastructure (EHRI) aims to support Holocaust research by making information about dispersed Holocaust material accessible and interconnected through its services. Creating a tool capable of detecting named entities in texts such as Holocaust testimonies or archival descriptions would make it easier to link more material with relevant identifiers in domain-specific controlled vocabularies, semantically enriching it, and making it more discoverable. With this paper, we release EHRI-NER, a multilingual dataset (Czech, German, English, French, Hungarian, Dutch, Polish, Slovak, Yiddish) for Named Entity Recognition (NER) in Holocaust-related texts. EHRI-NER is built by aggregating all the annotated documents in the EHRI Online Editions and converting them to a format suitable for training NER models. We leverage this dataset to fine-tune the multilingual Transformer-based language model XLM-RoBERTa (XLM-R) to determine whether a single model can be trained to recognize entities across different document types and languages. The results of our experiments show that despite our relatively small dataset, in a multilingual experiment setup, the overall F1 score achieved by XLM-R fine-tuned on multilingual annotations is 81.5{\textbackslash}\%. We argue that this score is sufficiently high to consider the next steps towards deploying this model.},
	urldate = {2024-04-29},
	booktitle = {{LREC}-{COLING} 2024 - {Joint} {International} {Conference} on {Computational} {Linguistics}, {Language} {Resources} and {Evaluation}},
	publisher = {ELRA Language Resources Association (ELRA); International Committee on Computational Linguistics (ICCL)},
	author = {Dermentzi, Maria and Scheithauer, Hugo},
	month = may,
	year = {2024},
	keywords = {Digital Editions, Holocaust Testimonies, Multilingual, Named Entity Recognition, Transfer Learning, Transformers},
}

APA: Dermentzi, M., & Scheithauer, H. (2024, May). Repurposing Holocaust - Related Digital Scholarly Editions to Develop Multilingual Domain - Specific Named Entity Recognition Tools. LREC - COLING 2024 - Joint International Conference on Computational Linguistics, Language Resources and Evaluation. HTRes@LREC - COLING 2024, Torino, Italy. [https://hal.science/hal - 04547222](https://hal.science/hal - 04547222)

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご