🚀 烏茲別克語命名實體識別模型
本模型是基於烏茲別克語命名實體識別(NER)數據集對 [FacebookAI/xlm - roberta - large](https://huggingface.co/FacebookAI/xlm - roberta - large) 進行微調後的版本。它在評估集上取得了以下成績:
- 損失值:0.1754
- 精確率:0.5848
- 召回率:0.6313
- F1值:0.6071
- 準確率:0.9386
🚀 快速開始
本模型是 [FacebookAI/xlm - roberta - large](https://huggingface.co/FacebookAI/xlm - roberta - large) 在烏茲別克語NER數據集上的微調版本。它在評估集上取得了如下結果:
- 損失:0.1754
- 精確率:0.5848
- 召回率:0.6313
- F1值:0.6071
- 準確率:0.9386
📦 安裝指南
文檔未提供安裝步驟,可參考 transformers
庫的官方安裝指南進行安裝。
💻 使用示例
基礎用法
from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch
custom_id2label = {
0: "O", 1: "B-CARDINAL", 2: "I-CARDINAL", 3: "B-DATE", 4: "I-DATE",
5: "B-EVENT", 6: "I-EVENT", 7: "B-GPE", 8: "I-GPE", 9: "B-LOC", 10: "I-LOC",
11: "B-MONEY", 12: "I-MONEY", 13: "B-ORDINAL", 14: "B-ORG", 15: "I-ORG",
16: "B-PERCENT", 17: "I-PERCENT", 18: "B-PERSON", 19: "I-PERSON",
20: "B-TIME", 21: "I-TIME"
}
custom_label2id = {v: k for k, v in custom_id2label.items()}
model_name = "mustafoyev202/roberta-uz"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name, num_labels=23)
model.config.id2label = custom_id2label
model.config.label2id = custom_label2id
text = "Tesla kompaniyasi AQSHda joylashgan."
tokens = tokenizer(text.split(), return_tensors="pt", is_split_into_words=True)
with torch.no_grad():
logits = model(**tokens).logits
predicted_token_class_ids = logits.argmax(-1).squeeze().tolist()
word_ids = tokens.word_ids()
previous_word_id = None
word_predictions = {}
for i, word_id in enumerate(word_ids):
if word_id is not None:
label = custom_id2label[predicted_token_class_ids[i]]
if word_id != previous_word_id:
word_predictions[word_id] = label
previous_word_id = word_id
words = text.split()
final_predictions = [(word, word_predictions.get(i, "O")) for i, word in enumerate(words)]
print("Predictions:")
for word, label in final_predictions:
print(f"{word}: {label}")
labels = torch.tensor([predicted_token_class_ids]).unsqueeze(0)
loss = model(**tokens, labels=labels).loss
print("\nLoss:", round(loss.item(), 2))
🔧 技術細節
訓練超參數
訓練過程中使用了以下超參數:
- 學習率:1e - 05
- 訓練批次大小:8
- 評估批次大小:8
- 隨機種子:42
- 梯度累積步數:8
- 總訓練批次大小:64
- 優化器:使用
OptimizerNames.ADAMW_TORCH
,betas=(0.9, 0.999)
,epsilon=1e - 08
,無額外優化器參數
- 學習率調度器類型:
cosine_with_restarts
- 學習率調度器熱身比例:0.08
- 訓練輪數:3
- 混合精度訓練:
Native AMP
訓練結果
訓練損失 |
輪數 |
步數 |
驗證損失 |
精確率 |
召回率 |
F1值 |
準確率 |
0.2474 |
0.4662 |
100 |
0.2283 |
0.4911 |
0.5164 |
0.5035 |
0.9284 |
0.2039 |
0.9324 |
200 |
0.1942 |
0.5495 |
0.5836 |
0.5661 |
0.9345 |
0.1949 |
1.3963 |
300 |
0.1855 |
0.5591 |
0.6348 |
0.5945 |
0.9359 |
0.19 |
1.8625 |
400 |
0.1800 |
0.5604 |
0.6279 |
0.5922 |
0.9361 |
0.1769 |
2.3263 |
500 |
0.1761 |
0.5806 |
0.6262 |
0.6025 |
0.9381 |
0.1765 |
2.7925 |
600 |
0.1754 |
0.5849 |
0.6311 |
0.6071 |
0.9386 |
框架版本
Transformers
4.49.0
Pytorch
2.5.1+cu124
Datasets
3.3.2
Tokenizers
0.21.0
📄 許可證
本模型採用 MIT 許可證。
📊 模型信息
屬性 |
詳情 |
模型名稱 |
烏茲別克語命名實體識別模型 |
基礎模型 |
[FacebookAI/xlm - roberta - large](https://huggingface.co/FacebookAI/xlm - roberta - large) |
數據集 |
risqaliyevds/uzbek_ner |
評估指標 |
精確率、召回率、F1值、準確率 |