grc-ner-xlmr開源模型 - 免費實現古希臘語人物、地點等實體識別

首頁

Grc Ner Xlmr

由UGARIT開發

預訓練的古希臘語NER標註模型，支持人物、地點、民族/宗教等實體識別

序列標註

Transformers

其他開源協議:MIT #古希臘語NER #歷史文獻分析 #多類別實體識別

下載量 22

發布時間 : 3/31/2024

模型概述

該模型是基於Transformer架構的古希臘語命名實體識別與分類模型，專門用於處理古希臘語文本中的實體標註任務。

模型特點

多類別實體識別

能夠識別古希臘語文本中的人物、地點、民族/宗教等多種實體類型

高精度標註

在人物識別上達到94%以上的F1值，整體F1值超過89%

多樣化訓練數據

使用包括《哲人宴》、《希臘志》、《奧德賽》等多部古希臘經典作品的標註數據進行訓練

模型能力

古希臘語文本分析

命名實體識別

實體分類

使用案例

古典文獻研究

古典文本實體標註

自動標註古希臘文獻中的人物、地點等實體

可幫助研究者快速分析文本中的實體分佈和關係

數字人文項目

為數字雅典奈烏斯、數字周遊記等項目提供自動標註支持

提高古典文本數字化處理效率

語言學教學

古希臘語教學輔助

幫助學生識別文本中的關鍵實體

提升語言學習效率

🚀 古希臘語命名實體識別

本項目提供了一個預訓練的古希臘語命名實體識別（NER）標籤模型，能夠有效識別古希臘語文本中的各類實體。

🚀 快速開始

你可以通過以下步驟快速使用該模型：

打開此 Colab 筆記本，其中包含了使用模型所需的代碼。
運行以下代碼示例：

from transformers import pipeline

# create pipeline for NER
ner = pipeline('ner', model="UGARIT/grc-ner-xlmr", aggregation_strategy = 'first')
ner("ταῦτα εἴπας ὁ Ἀλέξανδρος παρίζει Πέρσῃ ἀνδρὶ ἄνδρα Μακεδόνα ὡς γυναῖκα τῷ λόγῳ · οἳ δέ , ἐπείτε σφέων οἱ Πέρσαι ψαύειν ἐπειρῶντο , διεργάζοντο αὐτούς .")

輸出示例

[{'entity_group': 'PER',
  'score': 0.9999428,
  'word': '',
  'start': 13,
  'end': 14},
 {'entity_group': 'PER',
  'score': 0.99994195,
  'word': 'Ἀλέξανδρος',
  'start': 14,
  'end': 24},
 {'entity_group': 'NORP',
  'score': 0.9087087,
  'word': 'Πέρσῃ',
  'start': 32,
  'end': 38},
 {'entity_group': 'NORP',
  'score': 0.97572577,
  'word': 'Μακεδόνα',
  'start': 50,
  'end': 59},
 {'entity_group': 'NORP',
  'score': 0.9993412,
  'word': 'Πέρσαι',
  'start': 104,
  'end': 111}]

✨ 主要特性

專門為古希臘語設計的預訓練 NER 模型。
基於可用的古希臘語標註語料庫進行訓練，具有較高的準確性。

📦 安裝指南

文檔未提及安裝步驟，可參考 Colab 筆記本中的代碼進行使用。

💻 使用示例

基礎用法

from transformers import pipeline

# create pipeline for NER
ner = pipeline('ner', model="UGARIT/grc-ner-xlmr", aggregation_strategy = 'first')
ner("ταῦτα εἴπας ὁ Ἀλέξανδρος παρίζει Πέρσῃ ἀνδρὶ ἄνδρα Μακεδόνα ὡς γυναῖκα τῷ λόγῳ · οἳ δέ , ἐπείτε σφέων οἱ Πέρσαι ψαύειν ἐπειρῶντο , διεργάζοντο αὐτούς .")

📚 詳細文檔

數據

我們在可用的古希臘語標註語料庫上訓練了這些模型。目前只有兩個規模較大的古希臘語標註數據集，它們目前正在發佈中：

第一個是由 Berti 2023 開發的，包含了對 Athenaeus 的 Deipnosophists 的全文標註，該數據集是在數字 Athenaeus 項目的背景下開發的。
第二個是由 Foka 等人 2020 開發的，是對 Pausanias 的 Periegesis Hellados 的全文標註，該數據集是在數字 Periegesis 項目的背景下開發的。

此外，我們還使用了學生和學者在 Recogito 上標註的較小語料庫：

由 Kemp 2021 標註的《奧德賽》。
一個混合語料庫，包括歸於 Apollodorus 的《圖書館》和 Strabo 的《地理學》的節選，由 Chiara Palladino 標註。
由 Thomas Visser 創建的 Xenophon 的《遠征記》第 1 卷。
由 Rachel Milio 創建的 Demosthenes 的《反奈亞拉》。

訓練數據集

數據集	人物	地點	民族/宗教/政治團體	其他
《奧德賽》	2469	698	0	0
《智者之宴》	14921	2699	5110	3060
《希臘遊記》	10205	8670	4972	0
其他數據集	3283	2040	1089	0
總計	30878	14107	11171	3060

驗證數據集

數據集	人物	地點	民族/宗教/政治團體	其他
《遠征記》	1190	796	857	0

結果

類別	指標	測試集	驗證集
地點	精確率	83.33%	88.66%
	召回率	81.27%	88.94%
	F1 值	82.29%	88.80%
其他	精確率	83.25%	0
	召回率	81.21%	0
	F1 值	82.22%	0
民族/宗教/政治團體	精確率	88.71%	94.76%
	召回率	90.76%	94.50%
	F1 值	89.73%	94.63%
人物	精確率	91.72%	94.22%
	召回率	94.42%	96.06%
	F1 值	93.05%	95.13%
總體	精確率	88.83%	92.91%
	召回率	89.99%	93.72%
	F1 值	89.41%	93.32%
	準確率	97.50%	98.87%

引用

@inproceedings{palladino-yousef-2024-development,
    title = "Development of Robust {NER} Models and Named Entity Tagsets for {A}ncient {G}reek",
    author = "Palladino, Chiara  and
      Yousef, Tariq",
    editor = "Sprugnoli, Rachele  and
      Passarotti, Marco",
    booktitle = "Proceedings of the Third Workshop on Language Technologies for Historical and Ancient Languages (LT4HALA) @ LREC-COLING-2024",
    month = may,
    year = "2024",
    address = "Torino, Italia",
    publisher = "ELRA and ICCL",
    url = "https://aclanthology.org/2024.lt4hala-1.11",
    pages = "89--97",
    abstract = "This contribution presents a novel approach to the development and evaluation of transformer-based models for Named Entity Recognition and Classification in Ancient Greek texts. We trained two models with annotated datasets by consolidating potentially ambiguous entity types under a harmonized set of classes. Then, we tested their performance with out-of-domain texts, reproducing a real-world use case. Both models performed very well under these conditions, with the multilingual model being slightly superior on the monolingual one. In the conclusion, we emphasize current limitations due to the scarcity of high-quality annotated corpora and to the lack of cohesive annotation strategies for ancient languages.",
}