🚀 葡萄牙语词性标注器
本项目针对词性标注任务,使用 MacMorpho 语料库对 BERTimbau 模型进行了微调,训练 10 个轮次后,整体 F1 分数达到了 0.9826。
🚀 快速开始
本项目通过对BERTimbau模型微调,实现了高性能的葡萄牙语词性标注。以下是相关的评估指标和参数设置。
📚 详细文档
评估指标
Precision Recall F1 Suport
accuracy 0.98 33729
macro avg 0.96 0.95 0.95 33729
weighted avg 0.98 0.98 0.98 33729
F1: 0.9826 Accuracy: 0.9826
参数设置
nclasses = 27
nepochs = 30
batch_size = 32
batch_status = 32
learning_rate = 1e-5
early_stop = 3
max_length = 200
词性标签说明
标签 |
含义 |
ADJ |
形容词 |
ADV |
副词 |
ADV-KS |
从属连接副词 |
ADV-KS-REL |
从属关系副词 |
ART |
冠词 |
CUR |
货币 |
IN |
感叹词 |
KC |
并列连词 |
KS |
从属连词 |
N |
名词 |
NPROP |
专有名词 |
NUM |
数字 |
PCP |
分词 |
PDEN |
指示词 |
PREP |
介词 |
PROADJ |
形容词性代词 |
PRO-KS |
从属连接代词 |
PRO-KS-REL |
从属连接关系代词 |
PROPESS |
人称代词 |
PROSUB |
名词性代词 |
V |
动词 |
VAUX |
助动词 |
📖 引用方式
@article{
Schneider_postagger_2023,
place={Brasil},
title={Developing a Transformer-based Clinical Part-of-Speech Tagger for Brazilian Portuguese},
volume={15},
url={https://jhi.sbis.org.br/index.php/jhi-sbis/article/view/1086},
DOI={10.59681/2175-4411.v15.iEspecial.2023.1086},
abstractNote={<p>Electronic Health Records are a valuable source of information to be extracted by means of natural language processing (NLP) tasks, such as morphosyntactic word tagging. Although there have been significant advances in health NLP, such as the Transformer architecture, languages such as Portuguese are still underrepresented. This paper presents taggers developed for Portuguese texts, fine-tuned using BioBERtpt (clinical/biomedical) and BERTimbau (generic) models on a POS-tagged corpus. We achieved an accuracy of 0.9826, state-of-the-art for the corpus used. In addition, we performed a human-based evaluation of the trained models and others in the literature, using authentic clinical narratives. Our clinical model achieved 0.8145 in accuracy compared to 0.7656 for the generic model. It also showed competitive results compared to models trained specifically with clinical texts, evidencing domain impact on the base model in NLP tasks.</p>},
number={Especial}, journal={Journal of Health Informatics},
author={Schneider, Elisa Terumi Rubel and Gumiel, Yohan Bonescki and Oliveira, Lucas Ferro Antunes de and Montenegro, Carolina de Oliveira and Barzotto, Laura Rubel and Moro, Claudia and Pagano, Adriana and Paraiso, Emerson Cabrera},
year={2023},
month={jul.} }
❓ 常见问题
如果您有任何疑问,请在 NLP Portuguese POS-Tagger 项目中提交 GitHub issue。