bert-portuguese-ner开源模型 - 免费用于档案文献葡萄牙语命名实体识别

首页

Bert Portuguese Ner

由 lfcc 开发

基于BERT的葡萄牙语命名实体识别模型，专门针对档案文献进行微调

序列标注

Transformers

开源协议:MIT #葡萄牙语NER #档案文献解析 #高精度实体识别

下载量 2,229

发布时间 : 3/2/2022

模型简介

该模型是基于neuralmind/bert-base-portuguese-cased微调的版本，专门用于葡萄牙语档案文献的命名实体识别任务，能够识别日期、职业、人物、地点和组织等实体类别。

模型特点

高准确率

在评估集上达到97%的准确率和93.12%的F1值

专业领域优化

专门针对葡萄牙语档案文献进行微调，适合处理历史文档

多类别识别

能够识别日期、职业、人物、地点和组织等多种实体类型

模型能力

葡萄牙语文本处理

命名实体识别

档案文献分析

使用案例

档案数字化

历史档案自动标注

自动识别历史档案中的人物、地点和组织信息

提高档案检索效率，支持智能浏览工具开发

学术研究

历史文献分析

从葡萄牙历史文献中提取关键实体信息

辅助历史学家进行文献研究和分析

🚀 bert-portuguese-ner

本模型是基于 neuralmind/bert-base-portuguese-cased 的微调版本，可用于解决葡萄牙语档案文档中的命名实体识别（NER）问题，在评估集上有出色的表现。

🚀 快速开始

本模型在评估集上取得了以下结果：

损失值：0.1140
精确率：0.9147
召回率：0.9483
F1值：0.9312
准确率：0.9700

✨ 主要特性

基于预训练的葡萄牙语 BERT 模型进行微调，适用于葡萄牙语档案文档的命名实体识别任务。
标注的标签包括：日期、职业、人物、地点、组织，能满足多种信息抽取需求。

📚 详细文档

模型描述

该模型在葡萄牙语档案文档的标记分类任务（NER）上进行了微调。标注的标签有：日期、职业、人物、地点、组织。

数据集

所有训练和评估数据可在以下链接获取：http://ner.epl.di.uminho.pt/

训练超参数

训练过程中使用了以下超参数：

学习率：2e-05
训练批次大小：16
评估批次大小：16
随机种子：42
优化器：Adam（β1=0.9，β2=0.999，ε=1e-08）
学习率调度器类型：线性
训练轮数：4

训练结果

训练损失	轮数	步数	验证损失	精确率	召回率	F1值	准确率
无记录	1.0	192	0.1438	0.8917	0.9392	0.9148	0.9633
0.2454	2.0	384	0.1222	0.8985	0.9417	0.9196	0.9671
0.0526	3.0	576	0.1098	0.9150	0.9481	0.9312	0.9698
0.0372	4.0	768	0.1140	0.9147	0.9483	0.9312	0.9700

框架版本

Transformers 4.10.0.dev0
Pytorch 1.9.0+cu111
Datasets 1.10.2
Tokenizers 0.10.3

引用信息

@Article{make4010003,
AUTHOR = {Cunha, Luís Filipe and Ramalho, José Carlos},
TITLE = {NER in Archival Finding Aids: Extended},
JOURNAL = {Machine Learning and Knowledge Extraction},
VOLUME = {4},
YEAR = {2022},
NUMBER = {1},
PAGES = {42--65},
URL = {https://www.mdpi.com/2504-4990/4/1/3},
ISSN = {2504-4990},
ABSTRACT = {The amount of information preserved in Portuguese archives has increased over the years. These documents represent a national heritage of high importance, as they portray the country&rsquo;s history. Currently, most Portuguese archives have made their finding aids available to the public in digital format, however, these data do not have any annotation, so it is not always easy to analyze their content. In this work, Named Entity Recognition solutions were created that allow the identification and classification of several named entities from the archival finding aids. These named entities translate into crucial information about their context and, with high confidence results, they can be used for several purposes, for example, the creation of smart browsing tools by using entity linking and record linking techniques. In order to achieve high result scores, we annotated several corpora to train our own Machine Learning algorithms in this context domain. We also used different architectures, such as CNNs, LSTMs, and Maximum Entropy models. Finally, all the created datasets and ML models were made available to the public with a developed web platform, NER@DI.},
DOI = {10.3390/make4010003}
}