🚀 (BERT base) 葡萄牙语法律领域命名实体识别模型
本项目的 ner-legal-bert-base-cased-ptbr 是一个用于葡萄牙语法律领域的命名实体识别(NER)模型(即标记分类模型)。它基于 dominguesm/legal-bert-base-cased-ptbr 模型,通过命名实体识别目标进行微调得到。该模型旨在助力法律领域的自然语言处理研究、计算机法学以及法律技术应用。
模型信息
属性 |
详情 |
模型类型 |
基于 BERT base 的葡萄牙语法律领域命名实体识别模型 |
训练数据 |
巴西联邦最高法院提供的法律文档,包含 971932 个训练样本、53996 个验证样本和 53997 个测试样本 |
标签说明
模型使用了以下标签,这些标签的灵感来源于 LeNER_br 数据集:
PESSOA
(人物)
ORGANIZACAO
(组织)
LOCAL
(地点)
TEMPO
(时间)
LEGISLACAO
(立法)
JURISPRUDENCIA
(判例法)
🚀 快速开始
推理使用
基础用法
from transformers import AutoModelForTokenClassification, AutoTokenizer
import torch
model_name = "dominguesm/ner-legal-bert-base-cased-ptbr"
model = AutoModelForTokenClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
input_text = "Acrescento que não há de se falar em violação do artigo 114, § 3º, da Constituição Federal, posto que referido dispositivo revela-se impertinente, tratando da possibilidade de ajuizamento de dissídio coletivo pelo Ministério Público do Trabalho nos casos de greve em atividade essencial."
inputs = tokenizer(input_text, max_length=512, truncation=True, return_tensors="pt")
tokens = inputs.tokens()
outputs = model(**inputs).logits
predictions = torch.argmax(outputs, dim=2)
for token, prediction in zip(tokens, predictions[0].numpy()):
print((token, model.config.id2label[prediction]))
高级用法
你也可以使用 pipeline
进行推理,但它在处理输入序列的最大长度时似乎存在一些问题。
from transformers import pipeline
model_name = "dominguesm/ner-legal-bert-base-cased-ptbr"
ner = pipeline(
"ner",
model=model_name
)
ner(input_text, aggregation_strategy="average")
📦 安装指南
文档未提及安装步骤,可参考 transformers
库的官方安装文档进行安装。
🔧 技术细节
超参数设置
批次大小、学习率等
per_device_batch_size
= 64
gradient_accumulation_steps
= 2
learning_rate
= 2e-5
num_train_epochs
= 3
weight_decay
= 0.01
optimizer
= torch.optim.AdamW
epsilon
= 1e-08
lr_scheduler_type
= linear
模型保存与加载
save_total_limit
= 3
logging_steps
= 1000
eval_steps
= logging_steps
evaluation_strategy
= 'steps'
logging_strategy
= 'steps'
save_strategy
= 'steps'
save_steps
= logging_steps
load_best_model_at_end
= True
fp16
= True
训练结果
Num examples = 971932
Num Epochs = 3
Instantaneous batch size per device = 64
Total train batch size (w. parallel, distributed & accumulation) = 128
Gradient Accumulation steps = 2
Total optimization steps = 22779
Evaluation Infos:
Num examples = 53996
Batch size = 128
Step |
训练损失 |
验证损失 |
精确率 |
召回率 |
F1 准确率 |
1000 |
0.113900 |
0.057008 |
0.898600 |
0.938444 |
0.918090 |
2000 |
0.052800 |
0.048254 |
0.917243 |
0.941188 |
0.929062 |
3000 |
0.046200 |
0.043833 |
0.919706 |
0.948411 |
0.933838 |
4000 |
0.043500 |
0.039796 |
0.928439 |
0.947058 |
0.937656 |
5000 |
0.041400 |
0.039421 |
0.926103 |
0.952857 |
0.939290 |
6000 |
0.039700 |
0.038599 |
0.922376 |
0.956257 |
0.939011 |
7000 |
0.037800 |
0.036463 |
0.935125 |
0.950937 |
0.942964 |
8000 |
0.035900 |
0.035706 |
0.934638 |
0.954147 |
0.944292 |
9000 |
0.033800 |
0.034518 |
0.940354 |
0.951991 |
0.946136 |
10000 |
0.033600 |
0.033454 |
0.938170 |
0.956097 |
0.947049 |
11000 |
0.032700 |
0.032899 |
0.934130 |
0.959491 |
0.946641 |
12000 |
0.032200 |
0.032477 |
0.937400 |
0.959150 |
0.948151 |
13000 |
0.031200 |
0.033207 |
0.937058 |
0.960506 |
0.948637 |
14000 |
0.031400 |
0.031711 |
0.938765 |
0.959711 |
0.949123 |
15000 |
0.030600 |
0.031519 |
0.940488 |
0.959413 |
0.949856 |
16000 |
0.028500 |
0.031618 |
0.943643 |
0.957693 |
0.950616 |
17000 |
0.028000 |
0.031106 |
0.941109 |
0.960687 |
0.950797 |
18000 |
0.027800 |
0.030712 |
0.942821 |
0.960528 |
0.951592 |
19000 |
0.027500 |
0.030523 |
0.942950 |
0.960947 |
0.951864 |
20000 |
0.027400 |
0.030577 |
0.942462 |
0.961754 |
0.952010 |
21000 |
0.027000 |
0.030025 |
0.944483 |
0.960497 |
0.952422 |
22000 |
0.026800 |
0.030162 |
0.943868 |
0.961418 |
0.952562 |
验证指标(测试集)
- 样本数量 = 53997
overall_precision
(整体精确率): 0.9432396865925381
overall_recall
(整体召回率): 0.9614334116769161
overall_f1
(整体 F1 值): 0.9522496545298874
overall_accuracy
(整体准确率): 0.9894741602608071
标签 |
精确率 |
召回率 |
F1 准确率 |
实体示例数量 |
JURISPRUDENCIA |
0.8795197115548148 |
0.9037275221501844 |
0.8914593047810311 |
57223 |
LEGISLACAO |
0.9405395935529082 |
0.9514071028567378 |
0.9459421362370934 |
84642 |
LOCAL |
0.9011495452253004 |
0.9132358124779697 |
0.9071524233856495 |
56740 |
ORGANIZACAO |
0.9239028155165304 |
0.954964947845235 |
0.9391771163875446 |
183013 |
PESSOA |
0.9651685220572037 |
0.9738545198908279 |
0.9694920661875761 |
193456 |
TEMPO |
0.973704616066295 |
0.9918808401799004 |
0.9827086882453152 |
186103 |
📄 许可证
本模型使用 cc-by-4.0
许可证。
注意事项
本 README 文档参考了 Pierre Guillou 编写的 README,部分内容直接引用。