roberta-multilingual-medieval-nerオープンソースモデル - 無料で多言語の中世文献の固有表現抽出をサポート

ホーム

Roberta Multilingual Medieval Ner

magistermilitumによって開発

多言語RoBERTaモデルをファインチューニングした中世テキスト固有表現認識モデルで、ラテン語、フランス語、スペイン語の歴史文献分析をサポートします。

シーケンスラベリング

Transformers

複数言語対応#中世テキストNER #多言語歴史文献 #高精度エンティティ認識

ダウンロード数 38

リリース時間 : 4/24/2022

モデル概要

このモデルは中世憲章テキストの地名や人名エンティティを識別するために特別に設計されており、フラット化されたエンティティとネストされたエンティティの両方を認識可能で、11世紀から15世紀の歴史文献研究に適しています。

モデル特徴

多言語歴史テキストサポート

中世ラテン語、古フランス語、古スペイン語テキストに特化して最適化

高精度エンティティ認識

テストデータセットで98.01%の精度と97.08%の再現率を達成

ネストされたエンティティ処理

テキスト中のネストされた固有表現構造を識別可能

モデル能力

歴史テキストエンティティ認識

多言語テキスト処理

ネストされたエンティティ検出

使用事例

歴史研究

中世憲章分析

歴史文献から人物、場所、機関名を自動抽出

構造化された歴史的人物関係ネットワークを構築

デジタルヒューマニティーズ研究

歴史学者の大規模文献デジタル分析を支援

歴史文献処理効率の向上

アーカイブ管理

古典籍デジタル化

古典文書の主要エンティティ情報を自動タグ付け

検索可能な歴史アーカイブデータベース構築

🚀 roberta-multilingual-medieval-ner

このモデルは、中世の文書における場所や人物を認識するために、多言語のRobertaモデルを微調整したものです。中世ラテン語、古フランス語、古スペイン語のテキストに対応しています。

🚀 クイックスタート

モデルの使い方

このモデルは簡単に使うことができます。以下のコードを参考にしてください。

import torch
from transformers import pipeline

pipe = pipeline("token-classification", model="magistermilitum/roberta-multilingual-medieval-ner")

results = list(map(pipe, list_of_sentences))
results =[[[y["entity"],y["word"], y["start"], y["end"]] for y in x] for x in results]
print(results)

✨ 主な機能

中世の文書における場所や人物を認識することができます。
フラットおよびネストされた方式で、中世のテキスト内の場所と人物を認識します。
11世紀から15世紀までの中世ラテン語、古フランス語、古スペイン語の8,000の注釈付きテキストを含むトレーニングデータセットを使用しています。

📦 インストール

このモデルを使用するには、transformersライブラリが必要です。以下のコマンドでインストールできます。

pip install transformers

💻 使用例

基本的な使用法

import torch
from transformers import pipeline

pipe = pipeline("token-classification", model="magistermilitum/roberta-multilingual-medieval-ner")

results = list(map(pipe, list_of_sentences))
results =[[[y["entity"],y["word"], y["start"], y["end"]] for y in x] for x in results]
print(results)

高度な使用法

以下のコードは、モデルの推論結果をBIO形式のCONLL形式に変換するものです。

class TextProcessor:
    def __init__(self, filename):
        self.filename = filename
        self.sent_detector = nltk.data.load("tokenizers/punkt/english.pickle") #sentence tokenizer
        self.sentences = []
        self.new_sentences = []
        self.results = []
        self.new_sentences_token_info = []
        self.new_sentences_bio = []
        self.BIO_TAGS = []
        self.stripped_BIO_TAGS = []

    def read_file(self):
        #Reading a txt file with one document per line.
        with open(self.filename, 'r') as f:
            text = f.read()
        self.sentences = self.sent_detector.tokenize(text.strip())

    def process_sentences(self): #We split long sentences as encoder has a 256 max-lenght. Sentences with les of 40 words will be merged.
        for sentence in self.sentences:
            if len(sentence.split()) < 40 and self.new_sentences:
                self.new_sentences[-1] += " " + sentence
            else:
                self.new_sentences.append(sentence)

    def apply_model(self, pipe):
        self.results = list(map(pipe, self.new_sentences))
        self.results=[[[y["entity"],y["word"], y["start"], y["end"]] for y in x] for x in self.results]

    def tokenize_sentences(self):
        for n_s in self.new_sentences:
            tokens=n_s.split() # Basic tokenization
            token_info = []

            # Initialize a variable to keep track of character index
            char_index = 0
            # Iterate through the tokens and record start and end info
            for token in tokens:
                start = char_index
                end = char_index + len(token)  # Subtract 1 for the last character of the token
                token_info.append((token, start, end))

                char_index += len(token) + 1  # Add 1 for the whitespace
            self.new_sentences_token_info.append(token_info)

    def process_results(self): #merge subwords and BIO tags
        for result in self.results:
            merged_bio_result = []
            current_word = ""
            current_label = None
            current_start = None
            current_end = None
            for entity, subword, start, end in result:
                if subword.startswith("▁"):
                    subword = subword[1:]
                    merged_bio_result.append([current_word, current_label, current_start, current_end])
                    current_word = "" ; current_label = None ; current_start = None ; current_end = None
                if current_start is None:
                    current_word = subword ; current_label = entity ; current_start = start+1 ; current_end= end
                else:
                    current_word += subword ; current_end = end
            if current_word:
                merged_bio_result.append([current_word, current_label, current_start, current_end])
            self.new_sentences_bio.append(merged_bio_result[1:])

    def match_tokens_with_entities(self): #match BIO tags with tokens
        for i,ss in enumerate(self.new_sentences_token_info):
            for word in ss:
                for ent in self.new_sentences_bio[i]:
                    if word[1]==ent[2]:
                        if ent[1]=="L-PERS":
                            self.BIO_TAGS.append([word[0], "I-PERS", "B-LOC"])
                            break
                        else:
                            if "LOC" in ent[1]:
                                self.BIO_TAGS.append([word[0], "O", ent[1]])
                            else:
                                self.BIO_TAGS.append([word[0], ent[1], "O"])
                            break
                else:
                    self.BIO_TAGS.append([word[0], "O", "O"])

    def separate_dots_and_comma(self): #optional
        signs=[",", ";", ":", "."]
        for bio in self.BIO_TAGS:
            if any(bio[0][-1]==sign for sign in signs) and len(bio[0])>1:
                self.stripped_BIO_TAGS.append([bio[0][:-1], bio[1], bio[2]]); 
                self.stripped_BIO_TAGS.append([bio[0][-1], "O", "O"])
            else:
                self.stripped_BIO_TAGS.append(bio)

    def save_BIO(self):
        with open('output_BIO_a.txt', 'w', encoding='utf-8') as output_file:
            output_file.write("TOKEN\tPERS\tLOCS\n"+"\n".join(["\t".join(x) for x in self.stripped_BIO_TAGS]))

# Usage:
processor = TextProcessor('my_docs_file.txt')
processor.read_file()
processor.process_sentences()
processor.apply_model(pipe)
processor.tokenize_sentences()
processor.process_results()
processor.match_tokens_with_entities()
processor.separate_dots_and_comma()
processor.save_BIO()

直接使用例

以下の文は、BIO形式で注釈付けされます。

('Ego', 'O', 'O')
('Radulfus', 'B-PERS')
('de', 'I-PERS', 'O')
('Francorvilla', 'I-PERS', 'B-LOC')
('miles', 'O')
(',', 'O', 'O')
('notum', 'O', 'O')
('facio', 'O', 'O')
('tam', 'O', 'O')
('presentibus', 'O', 'O')
('quam', 'O', 'O')
('futuris', 'O', 'O')
('quod', 'O', 'O')
(',', 'O', 'O')
('cum', 'O', 'O')
('Guillelmo', 'B-PERS', 'O')
('Bateste', 'I-PERS', 'O')
('militi', 'O', 'O')
('de', 'O', 'O')
('Miliaco', 'O', 'B-LOC')

📚 ドキュメント

モデルの詳細

このモデルは、中世の文書における場所や人物を認識するために、多言語のRobertaモデルを微調整したものです。トレーニングデータセットには、11世紀から15世紀までの中世ラテン語、古フランス語、古スペイン語の8,000の注釈付きテキストが含まれています。

トレーニング手順

このモデルは、XML-Roberta-Largeを使用して5エポック間微調整されました。学習率は5e-5、バッチサイズは16です。

モデル情報

属性	详情
モデルタイプ	XLM-Roberta
開発者	Sergio Torres Aguilar
言語	中世ラテン語、スペイン語、フランス語
微調整元のモデル	固有表現認識

引用情報

@inproceedings{aguilar2022multilingual,
  title={Multilingual Named Entity Recognition for Medieval Charters Using Stacked Embeddings and Bert-based Models.},
  author={Aguilar, Sergio Torres},
  booktitle={Proceedings of the second workshop on language technologies for historical and ancient languages},
  pages={119--128},
  year={2022}
}