roberta-multilingual-medieval-ner開源模型 - 免費支持多語言中世紀文獻命名實體識別

首頁

Roberta Multilingual Medieval Ner

由magistermilitum開發

基於多語言RoBERTa模型微調的中世紀文本命名實體識別模型，支持拉丁語、法語和西班牙語的歷史文獻分析。

序列標註

Transformers

支持多種語言#中世紀文本NER #多語言歷史文獻 #高精度實體識別

下載量 38

發布時間 : 4/24/2022

模型概述

該模型專門用於識別中世紀特許狀文本中的地點和人名實體，支持扁平化及嵌套實體識別，適用於11至15世紀的歷史文獻研究。

模型特點

多語言歷史文本支持

專門針對中世紀拉丁語、古法語及古西班牙語文本優化

高精度實體識別

在測試集上達到98.01%精確率和97.08%召回率

嵌套實體處理

可識別文本中的嵌套命名實體結構

模型能力

歷史文本實體識別

多語言文本處理

嵌套實體檢測

使用案例

歷史研究

中世紀特許狀分析

自動提取歷史文獻中的人物、地點和機構名稱

建立結構化歷史人物關係網絡

數字人文研究

輔助歷史學家進行大規模文獻數字化分析

提升歷史文獻處理效率

檔案管理

古籍數字化

自動化標記古籍文檔中的關鍵實體信息

構建可檢索的歷史檔案數據庫

🚀 多語言中世紀命名實體識別模型（roberta-multilingual-medieval-ner）

這是一個基於多語言Roberta模型在中世紀憲章文本上微調的模型，旨在以扁平與嵌套的方式識別中世紀文本中的地點和人物。訓練數據集包含8000篇標註文本，涵蓋了11世紀至15世紀的中世紀拉丁語、古法語和古西班牙語。

🚀 快速開始

該模型的使用方式非常簡單：

import torch
from transformers import pipeline

pipe = pipeline("token-classification", model="magistermilitum/roberta-multilingual-medieval-ner")

results = list(map(pipe, list_of_sentences))
results =[[[y["entity"],y["word"], y["start"], y["end"]] for y in x] for x in results]
print(results)

✨ 主要特性

多語言支持：支持拉丁語、法語、西班牙語等多種語言，適用於中世紀多語言文本的命名實體識別。
高精度識別：在中世紀文本的地點和人物識別任務中，精度達到98.01%，召回率達到97.08%。
靈活的輸出格式：可以將模型推理結果轉換為CONLL格式，方便後續處理。

📦 安裝指南

文檔未提及安裝步驟，故跳過該章節。

💻 使用示例

基礎用法

import torch
from transformers import pipeline

pipe = pipeline("token-classification", model="magistermilitum/roberta-multilingual-medieval-ner")

results = list(map(pipe, list_of_sentences))
results =[[[y["entity"],y["word"], y["start"], y["end"]] for y in x] for x in results]
print(results)

高級用法

以下代碼展示瞭如何將模型推理結果轉換為CONLL格式：

class TextProcessor:
    def __init__(self, filename):
        self.filename = filename
        self.sent_detector = nltk.data.load("tokenizers/punkt/english.pickle") #sentence tokenizer
        self.sentences = []
        self.new_sentences = []
        self.results = []
        self.new_sentences_token_info = []
        self.new_sentences_bio = []
        self.BIO_TAGS = []
        self.stripped_BIO_TAGS = []

    def read_file(self):
        #Reading a txt file with one document per line.
        with open(self.filename, 'r') as f:
            text = f.read()
        self.sentences = self.sent_detector.tokenize(text.strip())

    def process_sentences(self): #We split long sentences as encoder has a 256 max-lenght. Sentences with les of 40 words will be merged.
        for sentence in self.sentences:
            if len(sentence.split()) < 40 and self.new_sentences:
                self.new_sentences[-1] += " " + sentence
            else:
                self.new_sentences.append(sentence)

    def apply_model(self, pipe):
        self.results = list(map(pipe, self.new_sentences))
        self.results=[[[y["entity"],y["word"], y["start"], y["end"]] for y in x] for x in self.results]

    def tokenize_sentences(self):
        for n_s in self.new_sentences:
            tokens=n_s.split() # Basic tokenization
            token_info = []

            # Initialize a variable to keep track of character index
            char_index = 0
            # Iterate through the tokens and record start and end info
            for token in tokens:
                start = char_index
                end = char_index + len(token)  # Subtract 1 for the last character of the token
                token_info.append((token, start, end))

                char_index += len(token) + 1  # Add 1 for the whitespace
            self.new_sentences_token_info.append(token_info)

    def process_results(self): #merge subwords and BIO tags
        for result in self.results:
            merged_bio_result = []
            current_word = ""
            current_label = None
            current_start = None
            current_end = None
            for entity, subword, start, end in result:
                if subword.startswith("▁"):
                    subword = subword[1:]
                    merged_bio_result.append([current_word, current_label, current_start, current_end])
                    current_word = "" ; current_label = None ; current_start = None ; current_end = None
                if current_start is None:
                    current_word = subword ; current_label = entity ; current_start = start+1 ; current_end= end
                else:
                    current_word += subword ; current_end = end
            if current_word:
                merged_bio_result.append([current_word, current_label, current_start, current_end])
            self.new_sentences_bio.append(merged_bio_result[1:])

    def match_tokens_with_entities(self): #match BIO tags with tokens
        for i,ss in enumerate(self.new_sentences_token_info):
            for word in ss:
                for ent in self.new_sentences_bio[i]:
                    if word[1]==ent[2]:
                        if ent[1]=="L-PERS":
                            self.BIO_TAGS.append([word[0], "I-PERS", "B-LOC"])
                            break
                        else:
                            if "LOC" in ent[1]:
                                self.BIO_TAGS.append([word[0], "O", ent[1]])
                            else:
                                self.BIO_TAGS.append([word[0], ent[1], "O"])
                            break
                else:
                    self.BIO_TAGS.append([word[0], "O", "O"])

    def separate_dots_and_comma(self): #optional
        signs=[",", ";", ":", "."]
        for bio in self.BIO_TAGS:
            if any(bio[0][-1]==sign for sign in signs) and len(bio[0])>1:
                self.stripped_BIO_TAGS.append([bio[0][:-1], bio[1], bio[2]]); 
                self.stripped_BIO_TAGS.append([bio[0][-1], "O", "O"])
            else:
                self.stripped_BIO_TAGS.append(bio)

    def save_BIO(self):
        with open('output_BIO_a.txt', 'w', encoding='utf-8') as output_file:
            output_file.write("TOKEN\tPERS\tLOCS\n"+"\n".join(["\t".join(x) for x in self.stripped_BIO_TAGS]))

# Usage:
processor = TextProcessor('my_docs_file.txt')
processor.read_file()
processor.process_sentences()
processor.apply_model(pipe)
processor.tokenize_sentences()
processor.process_results()
processor.match_tokens_with_entities()
processor.separate_dots_and_comma()
processor.save_BIO()

直接使用示例

對於句子 "Ego Radulfus de Francorvilla miles, notum facio tam presentibus cum futuris quod, cum Guillelmo Bateste militi de Miliaco"，模型將以BIO格式進行標註：

('Ego', 'O', 'O')
('Radulfus', 'B-PERS')
('de', 'I-PERS', 'O')
('Francorvilla', 'I-PERS', 'B-LOC')
('miles', 'O')
(',', 'O', 'O')
('notum', 'O', 'O')
('facio', 'O', 'O')
('tam', 'O', 'O')
('presentibus', 'O', 'O')
('quam', 'O', 'O')
('futuris', 'O', 'O')
('quod', 'O', 'O')
(',', 'O', 'O')
('cum', 'O', 'O')
('Guillelmo', 'B-PERS', 'O')
('Bateste', 'I-PERS', 'O')
('militi', 'O', 'O')
('de', 'O', 'O')
('Miliaco', 'O', 'B-LOC')

📚 詳細文檔

模型信息

屬性	詳情
模型開發者	[Sergio Torres Aguilar]
模型類型	[XLM-Roberta]
支持語言（NLP）	[中世紀拉丁語、西班牙語、法語]
微調基礎模型	[命名實體識別]

訓練過程

該模型在XML-Roberta-Large的基礎上進行了5個epoch的微調，學習率為5e-5，批次大小為16。

BibTeX引用

@inproceedings{aguilar2022multilingual,
  title={Multilingual Named Entity Recognition for Medieval Charters Using Stacked Embeddings and Bert-based Models.},
  author={Aguilar, Sergio Torres},
  booktitle={Proceedings of the second workshop on language technologies for historical and ancient languages},
  pages={119--128},
  year={2022}
}