Roberta Multilingual Medieval Ner
Model Overview
Model Features
Model Capabilities
Use Cases
đ roberta-multilingual-medieval-ner
This is a fine - tuned multilingual Roberta model for named entity recognition in medieval texts, capable of identifying locations and persons in flat and nested ways.
đ Quick Start
The model is intended to be used in a simple way:
import torch
from transformers import pipeline
pipe = pipeline("token-classification", model="magistermilitum/roberta-multilingual-medieval-ner")
results = list(map(pipe, list_of_sentences))
results =[[[y["entity"],y["word"], y["start"], y["end"]] for y in x] for x in results]
print(results)
⨠Features
- Recognize locations and persons in medieval texts in both flat and nested manners.
- Support multiple historical languages including medieval Latin, Old French, and Old Spanish.
đĻ Installation
No specific installation steps are provided in the original document, so this section is skipped.
đģ Usage Examples
Basic Usage
The following code shows how to use the model for named entity recognition:
import torch
from transformers import pipeline
pipe = pipeline("token-classification", model="magistermilitum/roberta-multilingual-medieval-ner")
results = list(map(pipe, list_of_sentences))
results =[[[y["entity"],y["word"], y["start"], y["end"]] for y in x] for x in results]
print(results)
Advanced Usage
The following snippet can transform model inferences to CONLL format using the BIO format:
class TextProcessor:
def __init__(self, filename):
self.filename = filename
self.sent_detector = nltk.data.load("tokenizers/punkt/english.pickle") #sentence tokenizer
self.sentences = []
self.new_sentences = []
self.results = []
self.new_sentences_token_info = []
self.new_sentences_bio = []
self.BIO_TAGS = []
self.stripped_BIO_TAGS = []
def read_file(self):
#Reading a txt file with one document per line.
with open(self.filename, 'r') as f:
text = f.read()
self.sentences = self.sent_detector.tokenize(text.strip())
def process_sentences(self): #We split long sentences as encoder has a 256 max-lenght. Sentences with les of 40 words will be merged.
for sentence in self.sentences:
if len(sentence.split()) < 40 and self.new_sentences:
self.new_sentences[-1] += " " + sentence
else:
self.new_sentences.append(sentence)
def apply_model(self, pipe):
self.results = list(map(pipe, self.new_sentences))
self.results=[[[y["entity"],y["word"], y["start"], y["end"]] for y in x] for x in self.results]
def tokenize_sentences(self):
for n_s in self.new_sentences:
tokens=n_s.split() # Basic tokenization
token_info = []
# Initialize a variable to keep track of character index
char_index = 0
# Iterate through the tokens and record start and end info
for token in tokens:
start = char_index
end = char_index + len(token) # Subtract 1 for the last character of the token
token_info.append((token, start, end))
char_index += len(token) + 1 # Add 1 for the whitespace
self.new_sentences_token_info.append(token_info)
def process_results(self): #merge subwords and BIO tags
for result in self.results:
merged_bio_result = []
current_word = ""
current_label = None
current_start = None
current_end = None
for entity, subword, start, end in result:
if subword.startswith("â"):
subword = subword[1:]
merged_bio_result.append([current_word, current_label, current_start, current_end])
current_word = "" ; current_label = None ; current_start = None ; current_end = None
if current_start is None:
current_word = subword ; current_label = entity ; current_start = start+1 ; current_end= end
else:
current_word += subword ; current_end = end
if current_word:
merged_bio_result.append([current_word, current_label, current_start, current_end])
self.new_sentences_bio.append(merged_bio_result[1:])
def match_tokens_with_entities(self): #match BIO tags with tokens
for i,ss in enumerate(self.new_sentences_token_info):
for word in ss:
for ent in self.new_sentences_bio[i]:
if word[1]==ent[2]:
if ent[1]=="L-PERS":
self.BIO_TAGS.append([word[0], "I-PERS", "B-LOC"])
break
else:
if "LOC" in ent[1]:
self.BIO_TAGS.append([word[0], "O", ent[1]])
else:
self.BIO_TAGS.append([word[0], ent[1], "O"])
break
else:
self.BIO_TAGS.append([word[0], "O", "O"])
def separate_dots_and_comma(self): #optional
signs=[",", ";", ":", "."]
for bio in self.BIO_TAGS:
if any(bio[0][-1]==sign for sign in signs) and len(bio[0])>1:
self.stripped_BIO_TAGS.append([bio[0][:-1], bio[1], bio[2]]);
self.stripped_BIO_TAGS.append([bio[0][-1], "O", "O"])
else:
self.stripped_BIO_TAGS.append(bio)
def save_BIO(self):
with open('output_BIO_a.txt', 'w', encoding='utf-8') as output_file:
output_file.write("TOKEN\tPERS\tLOCS\n"+"\n".join(["\t".join(x) for x in self.stripped_BIO_TAGS]))
# Usage:
processor = TextProcessor('my_docs_file.txt')
processor.read_file()
processor.process_sentences()
processor.apply_model(pipe)
processor.tokenize_sentences()
processor.process_results()
processor.match_tokens_with_entities()
processor.separate_dots_and_comma()
processor.save_BIO()
Direct Use Example
A sentence as: "Ego Radulfus de Francorvilla miles, notum facio tam presentibus cum futuris quod, cum Guillelmo Bateste militi de Miliaco"
Will be annotated in BIO format as:
('Ego', 'O', 'O')
('Radulfus', 'B-PERS')
('de', 'I-PERS', 'O')
('Francorvilla', 'I-PERS', 'B-LOC')
('miles', 'O')
(',', 'O', 'O')
('notum', 'O', 'O')
('facio', 'O', 'O')
('tam', 'O', 'O')
('presentibus', 'O', 'O')
('quam', 'O', 'O')
('futuris', 'O', 'O')
('quod', 'O', 'O')
(',', 'O', 'O')
('cum', 'O', 'O')
('Guillelmo', 'B-PERS', 'O')
('Bateste', 'I-PERS', 'O')
('militi', 'O', 'O')
('de', 'O', 'O')
('Miliaco', 'O', 'B-LOC')
đ Documentation
Model Details
This is a fine - tuned version of the multilingual Roberta model on medieval charters. The model is intended to recognize Locations and persons in medieval texts in a Flat and nested manner. The train dataset entails 8k annotated texts on medieval latin, Old french and Old Spanish from a period ranging from 11th to 15th centuries.
Model Information
Property | Details |
---|---|
Developed by | [Sergio Torres Aguilar] |
Model Type | [XLM - Roberta] |
Language(s) (NLP) | [Medieval Latin, Spanish, French] |
Finetuned from model [optional] | [Named Entity Recognition] |
Training Procedure
The model was fine - tuned during 5 epoch on the XML - Roberta - Large using a 5e - 5 Lr and a batch size of 16.
BibTeX
@inproceedings{aguilar2022multilingual,
title={Multilingual Named Entity Recognition for Medieval Charters Using Stacked Embeddings and Bert-based Models.},
author={Aguilar, Sergio Torres},
booktitle={Proceedings of the second workshop on language technologies for historical and ancient languages},
pages={119--128},
year={2022}
}
đ License
The model is released under the cc - by - nc - 4.0 license.
đĻ Model Index
- Name: roberta - multilingual - medieval - ner
- Results:
- Task: Named Entity Recognition
- Metrics:
- Precision: 98.01
- Recall: 97.08
đŠ Model Card Contact
[sergio.torres@uni.lu]






