🚀 Named Entity Recognition for Ancient Greek
A pre-trained NER tagging model tailored for Ancient Greek, addressing the need for accurate named - entity recognition in ancient texts.
📦 Installation
No specific installation steps were provided in the original document, so this section is skipped.
✨ Features
- Pretrained for NER tagging in Ancient Greek.
- Trained on multiple available annotated corpora in Ancient Greek.
📚 Documentation
Data
We trained the models on available annotated corpora in Ancient Greek. There are only two sizeable annotated datasets in Ancient Greek, which are currently under release:
- The first one by Berti 2023, consists of a fully annotated text of Athenaeus’ Deipnosophists, developed in the context of the Digital Athenaeus project.
- The second one by Foka et al. 2020, is a fully annotated text of Pausanias’ Periegesis Hellados, developed in the context of the Digital Periegesis project.
In addition, we used smaller corpora annotated by students and scholars on Recogito:
- The Odyssey annotated by Kemp 2021.
- A mixed corpus including excerpts from the Library attributed to Apollodorus and from Strabo’s Geography, annotated by Chiara Palladino.
- Book 1 of Xenophon’s Anabasis, created by Thomas Visser.
- Demosthenes’ Against Neaira, created by Rachel Milio.
Training Dataset
Property |
Person |
Location |
NORP |
MISC |
Odyssey |
2,469 |
698 |
0 |
0 |
Deipnosophists |
14,921 |
2,699 |
5,110 |
3,060 |
Pausanias |
10,205 |
8,670 |
4,972 |
0 |
Other Datasets |
3,283 |
2,040 |
1,089 |
0 |
Total |
30,878 |
14,107 |
11,171 |
3,060 |
Validation Dataset
Property |
Person |
Location |
NORP |
MISC |
Xenophon |
1,190 |
796 |
857 |
0 |
Results
Class |
Metric |
Test |
Validation |
LOC |
precision |
83.33% |
88.66% |
|
recall |
81.27% |
88.94% |
|
f1 |
82.29% |
88.80% |
MISC |
precision |
83.25% |
0 |
|
recall |
81.21% |
0 |
|
f1 |
82.22% |
0 |
NORP |
precision |
88.71% |
94.76% |
|
recall |
90.76% |
94.50% |
|
f1 |
89.73% |
94.63% |
PER |
precision |
91.72% |
94.22% |
|
recall |
94.42% |
96.06% |
|
f1 |
93.05% |
95.13% |
Overall |
precision |
88.83% |
92.91% |
|
recall |
89.99% |
93.72% |
|
f1 |
89.41% |
93.32% |
|
Accuracy |
97.50% |
98.87% |
💻 Usage Examples
Basic Usage
This colab notebook contains the necessary code to use the model.
from transformers import pipeline
ner = pipeline('ner', model="UGARIT/grc-ner-xlmr", aggregation_strategy = 'first')
ner("ταῦτα εἴπας ὁ Ἀλέξανδρος παρίζει Πέρσῃ ἀνδρὶ ἄνδρα Μακεδόνα ὡς γυναῖκα τῷ λόγῳ · οἳ δέ , ἐπείτε σφέων οἱ Πέρσαι ψαύειν ἐπειρῶντο , διεργάζοντο αὐτούς .")
Output
[{'entity_group': 'PER',
'score': 0.9999428,
'word': '',
'start': 13,
'end': 14},
{'entity_group': 'PER',
'score': 0.99994195,
'word': 'Ἀλέξανδρος',
'start': 14,
'end': 24},
{'entity_group': 'NORP',
'score': 0.9087087,
'word': 'Πέρσῃ',
'start': 32,
'end': 38},
{'entity_group': 'NORP',
'score': 0.97572577,
'word': 'Μακεδόνα',
'start': 50,
'end': 59},
{'entity_group': 'NORP',
'score': 0.9993412,
'word': 'Πέρσαι',
'start': 104,
'end': 111}]
📄 License
The project is licensed under the MIT license.
📚 Citation
@inproceedings{palladino-yousef-2024-development,
title = "Development of Robust {NER} Models and Named Entity Tagsets for {A}ncient {G}reek",
author = "Palladino, Chiara and
Yousef, Tariq",
editor = "Sprugnoli, Rachele and
Passarotti, Marco",
booktitle = "Proceedings of the Third Workshop on Language Technologies for Historical and Ancient Languages (LT4HALA) @ LREC-COLING-2024",
month = may,
year = "2024",
address = "Torino, Italia",
publisher = "ELRA and ICCL",
url = "https://aclanthology.org/2024.lt4hala-1.11",
pages = "89--97",
abstract = "This contribution presents a novel approach to the development and evaluation of transformer-based models for Named Entity Recognition and Classification in Ancient Greek texts. We trained two models with annotated datasets by consolidating potentially ambiguous entity types under a harmonized set of classes. Then, we tested their performance with out-of-domain texts, reproducing a real-world use case. Both models performed very well under these conditions, with the multilingual model being slightly superior on the monolingual one. In the conclusion, we emphasize current limitations due to the scarcity of high-quality annotated corpora and to the lack of cohesive annotation strategies for ancient languages.",
}