bert-portuguese-ner開源模型 - 免費用於檔案文獻葡萄牙語命名實體識別

首頁

Bert Portuguese Ner

由lfcc開發

基於BERT的葡萄牙語命名實體識別模型，專門針對檔案文獻進行微調

序列標註

Transformers

開源協議:MIT #葡萄牙語NER #檔案文獻解析 #高精度實體識別

下載量 2,229

發布時間 : 3/2/2022

模型概述

該模型是基於neuralmind/bert-base-portuguese-cased微調的版本，專門用於葡萄牙語檔案文獻的命名實體識別任務，能夠識別日期、職業、人物、地點和組織等實體類別。

模型特點

高準確率

在評估集上達到97%的準確率和93.12%的F1值

專業領域優化

專門針對葡萄牙語檔案文獻進行微調，適合處理歷史文檔

多類別識別

能夠識別日期、職業、人物、地點和組織等多種實體類型

模型能力

葡萄牙語文本處理

命名實體識別

檔案文獻分析

使用案例

檔案數字化

歷史檔案自動標註

自動識別歷史檔案中的人物、地點和組織信息

提高檔案檢索效率，支持智能瀏覽工具開發

學術研究

歷史文獻分析

從葡萄牙歷史文獻中提取關鍵實體信息

輔助歷史學家進行文獻研究和分析

🚀 bert-portuguese-ner

本模型是基於 neuralmind/bert-base-portuguese-cased 的微調版本，可用於解決葡萄牙語檔案文檔中的命名實體識別（NER）問題，在評估集上有出色的表現。

🚀 快速開始

本模型在評估集上取得了以下結果：

損失值：0.1140
精確率：0.9147
召回率：0.9483
F1值：0.9312
準確率：0.9700

✨ 主要特性

基於預訓練的葡萄牙語 BERT 模型進行微調，適用於葡萄牙語檔案文檔的命名實體識別任務。
標註的標籤包括：日期、職業、人物、地點、組織，能滿足多種信息抽取需求。

📚 詳細文檔

模型描述

該模型在葡萄牙語檔案文檔的標記分類任務（NER）上進行了微調。標註的標籤有：日期、職業、人物、地點、組織。

數據集

所有訓練和評估數據可在以下鏈接獲取：http://ner.epl.di.uminho.pt/

訓練超參數

訓練過程中使用了以下超參數：

學習率：2e-05
訓練批次大小：16
評估批次大小：16
隨機種子：42
優化器：Adam（β1=0.9，β2=0.999，ε=1e-08）
學習率調度器類型：線性
訓練輪數：4

訓練結果

訓練損失	輪數	步數	驗證損失	精確率	召回率	F1值	準確率
無記錄	1.0	192	0.1438	0.8917	0.9392	0.9148	0.9633
0.2454	2.0	384	0.1222	0.8985	0.9417	0.9196	0.9671
0.0526	3.0	576	0.1098	0.9150	0.9481	0.9312	0.9698
0.0372	4.0	768	0.1140	0.9147	0.9483	0.9312	0.9700

框架版本

Transformers 4.10.0.dev0
Pytorch 1.9.0+cu111
Datasets 1.10.2
Tokenizers 0.10.3

引用信息

@Article{make4010003,
AUTHOR = {Cunha, Luís Filipe and Ramalho, José Carlos},
TITLE = {NER in Archival Finding Aids: Extended},
JOURNAL = {Machine Learning and Knowledge Extraction},
VOLUME = {4},
YEAR = {2022},
NUMBER = {1},
PAGES = {42--65},
URL = {https://www.mdpi.com/2504-4990/4/1/3},
ISSN = {2504-4990},
ABSTRACT = {The amount of information preserved in Portuguese archives has increased over the years. These documents represent a national heritage of high importance, as they portray the country&rsquo;s history. Currently, most Portuguese archives have made their finding aids available to the public in digital format, however, these data do not have any annotation, so it is not always easy to analyze their content. In this work, Named Entity Recognition solutions were created that allow the identification and classification of several named entities from the archival finding aids. These named entities translate into crucial information about their context and, with high confidence results, they can be used for several purposes, for example, the creation of smart browsing tools by using entity linking and record linking techniques. In order to achieve high result scores, we annotated several corpora to train our own Machine Learning algorithms in this context domain. We also used different architectures, such as CNNs, LSTMs, and Maximum Entropy models. Finally, all the created datasets and ML models were made available to the public with a developed web platform, NER@DI.},
DOI = {10.3390/make4010003}
}