🚀 乌兹别克语命名实体识别模型
本模型是基于乌兹别克语命名实体识别(NER)数据集对 [FacebookAI/xlm - roberta - large](https://huggingface.co/FacebookAI/xlm - roberta - large) 进行微调后的版本。它在评估集上取得了以下成绩:
- 损失值:0.1754
- 精确率:0.5848
- 召回率:0.6313
- F1值:0.6071
- 准确率:0.9386
🚀 快速开始
本模型是 [FacebookAI/xlm - roberta - large](https://huggingface.co/FacebookAI/xlm - roberta - large) 在乌兹别克语NER数据集上的微调版本。它在评估集上取得了如下结果:
- 损失:0.1754
- 精确率:0.5848
- 召回率:0.6313
- F1值:0.6071
- 准确率:0.9386
📦 安装指南
文档未提供安装步骤,可参考 transformers
库的官方安装指南进行安装。
💻 使用示例
基础用法
from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch
custom_id2label = {
0: "O", 1: "B-CARDINAL", 2: "I-CARDINAL", 3: "B-DATE", 4: "I-DATE",
5: "B-EVENT", 6: "I-EVENT", 7: "B-GPE", 8: "I-GPE", 9: "B-LOC", 10: "I-LOC",
11: "B-MONEY", 12: "I-MONEY", 13: "B-ORDINAL", 14: "B-ORG", 15: "I-ORG",
16: "B-PERCENT", 17: "I-PERCENT", 18: "B-PERSON", 19: "I-PERSON",
20: "B-TIME", 21: "I-TIME"
}
custom_label2id = {v: k for k, v in custom_id2label.items()}
model_name = "mustafoyev202/roberta-uz"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name, num_labels=23)
model.config.id2label = custom_id2label
model.config.label2id = custom_label2id
text = "Tesla kompaniyasi AQSHda joylashgan."
tokens = tokenizer(text.split(), return_tensors="pt", is_split_into_words=True)
with torch.no_grad():
logits = model(**tokens).logits
predicted_token_class_ids = logits.argmax(-1).squeeze().tolist()
word_ids = tokens.word_ids()
previous_word_id = None
word_predictions = {}
for i, word_id in enumerate(word_ids):
if word_id is not None:
label = custom_id2label[predicted_token_class_ids[i]]
if word_id != previous_word_id:
word_predictions[word_id] = label
previous_word_id = word_id
words = text.split()
final_predictions = [(word, word_predictions.get(i, "O")) for i, word in enumerate(words)]
print("Predictions:")
for word, label in final_predictions:
print(f"{word}: {label}")
labels = torch.tensor([predicted_token_class_ids]).unsqueeze(0)
loss = model(**tokens, labels=labels).loss
print("\nLoss:", round(loss.item(), 2))
🔧 技术细节
训练超参数
训练过程中使用了以下超参数:
- 学习率:1e - 05
- 训练批次大小:8
- 评估批次大小:8
- 随机种子:42
- 梯度累积步数:8
- 总训练批次大小:64
- 优化器:使用
OptimizerNames.ADAMW_TORCH
,betas=(0.9, 0.999)
,epsilon=1e - 08
,无额外优化器参数
- 学习率调度器类型:
cosine_with_restarts
- 学习率调度器热身比例:0.08
- 训练轮数:3
- 混合精度训练:
Native AMP
训练结果
训练损失 |
轮数 |
步数 |
验证损失 |
精确率 |
召回率 |
F1值 |
准确率 |
0.2474 |
0.4662 |
100 |
0.2283 |
0.4911 |
0.5164 |
0.5035 |
0.9284 |
0.2039 |
0.9324 |
200 |
0.1942 |
0.5495 |
0.5836 |
0.5661 |
0.9345 |
0.1949 |
1.3963 |
300 |
0.1855 |
0.5591 |
0.6348 |
0.5945 |
0.9359 |
0.19 |
1.8625 |
400 |
0.1800 |
0.5604 |
0.6279 |
0.5922 |
0.9361 |
0.1769 |
2.3263 |
500 |
0.1761 |
0.5806 |
0.6262 |
0.6025 |
0.9381 |
0.1765 |
2.7925 |
600 |
0.1754 |
0.5849 |
0.6311 |
0.6071 |
0.9386 |
框架版本
Transformers
4.49.0
Pytorch
2.5.1+cu124
Datasets
3.3.2
Tokenizers
0.21.0
📄 许可证
本模型采用 MIT 许可证。
📊 模型信息
属性 |
详情 |
模型名称 |
乌兹别克语命名实体识别模型 |
基础模型 |
[FacebookAI/xlm - roberta - large](https://huggingface.co/FacebookAI/xlm - roberta - large) |
数据集 |
risqaliyevds/uzbek_ner |
评估指标 |
精确率、召回率、F1值、准确率 |