tf-xlm-r-ner-40-lang開源模型 - 支持40種語言的多語言命名實體識別

首頁

Tf Xlm R Ner 40 Lang

由jplu開發

基於XLM-Roberta-base的多語言命名實體識別模型，支持40種語言的實體識別

序列標註

Transformers

支持多種語言#多語言NER #跨語言實體識別 #XLM-R微調

下載量 969

發布時間 : 3/2/2022

模型概述

本模型是基於XLM-Roberta-base在40種語言上微調的命名實體識別模型，能夠識別地點(LOC)、組織(ORG)、人物(PER)等實體類型

模型特點

多語言支持

支持40種語言的命名實體識別，包括主要歐洲、亞洲和非洲語言

高性能

在40種語言上平均F1值達到0.87，其中人物識別F1值高達0.91

基於XLM-Roberta

利用強大的XLM-Roberta-base模型進行微調，具備優秀的跨語言表示能力

模型能力

多語言文本處理

命名實體識別

跨語言實體識別

使用案例

信息提取

多語言新聞分析

從不同語言的新聞文本中提取人物、組織和地點信息

可準確識別跨語言文本中的關鍵實體

跨語言文檔處理

處理包含多種語言的文檔，統一提取其中的命名實體

支持40種語言的實體識別，實現統一處理

知識圖譜構建

多語言知識圖譜

從不同語言的數據源中提取實體，構建跨語言知識圖譜

提供一致的實體識別能力，支持多語言知識融合

🚀 XLM-R + NER

本模型是在 Wikiann 數據集上，針對 XTREME 中提出的 40 種語言，對 XLM-Roberta-base 進行微調得到的。這仍是一項正在進行的工作，每次取得改進時，結果都會更新。

涵蓋的標籤如下：

LOC
ORG
PER
O

🚀 快速開始

復現結果

從 XTREME 倉庫下載並準備數據集。接下來，在 transformers 倉庫的根目錄下運行以下命令：

cd examples/ner
python run_tf_ner.py \
--data_dir . \
--labels ./labels.txt \
--model_name_or_path jplu/tf-xlm-roberta-base \
--output_dir model \
--max-seq-length 128 \
--num_train_epochs 2 \
--per_gpu_train_batch_size 16 \
--per_gpu_eval_batch_size 32 \
--do_train \
--do_eval \
--logging_dir logs \
--mode token-classification \
--evaluate_during_training \
--optimizer_name adamw

使用管道進行推理

from transformers import pipeline

nlp_ner = pipeline(
    "ner",
    model="jplu/tf-xlm-r-ner-40-lang",
    tokenizer=(
        'jplu/tf-xlm-r-ner-40-lang',  
        {"use_fast": True}),
    framework="tf"
)

text_fr = "Barack Obama est né à Hawaï."
text_en = "Barack Obama was born in Hawaii."
text_es = "Barack Obama nació en Hawai."
text_zh = "巴拉克·奧巴馬（Barack Obama）出生於夏威夷。"
text_ar = "ولد باراك أوباما في هاواي."

nlp_ner(text_fr)
#Output: [{'word': '▁Barack', 'score': 0.9894659519195557, 'entity': 'PER'}, {'word': '▁Obama', 'score': 0.9888848662376404, 'entity': 'PER'}, {'word': '▁Hawa', 'score': 0.998701810836792, 'entity': 'LOC'}, {'word': 'ï', 'score': 0.9987035989761353, 'entity': 'LOC'}]
nlp_ner(text_en)
#Output: [{'word': '▁Barack', 'score': 0.9929141998291016, 'entity': 'PER'}, {'word': '▁Obama', 'score': 0.9930834174156189, 'entity': 'PER'}, {'word': '▁Hawaii', 'score': 0.9986202120780945, 'entity': 'LOC'}]
nlp_ner(text_es)
#Output: [{'word': '▁Barack', 'score': 0.9944776296615601, 'entity': 'PER'}, {'word': '▁Obama', 'score': 0.9949177503585815, 'entity': 'PER'}, {'word': '▁Hawa', 'score': 0.9987911581993103, 'entity': 'LOC'}, {'word': 'i', 'score': 0.9984861612319946, 'entity': 'LOC'}]
nlp_ner(text_zh)
#Output: [{'word': '夏威夷', 'score': 0.9988449215888977, 'entity': 'LOC'}]
nlp_ner(text_ar)
#Output: [{'word': '▁با', 'score': 0.9903655648231506, 'entity': 'PER'}, {'word': 'راك', 'score': 0.9850614666938782, 'entity': 'PER'}, {'word': '▁أوباما', 'score': 0.9850308299064636, 'entity': 'PER'}, {'word': '▁ها', 'score': 0.9477543234825134, 'entity': 'LOC'}, {'word': 'وا', 'score': 0.9428229928016663, 'entity': 'LOC'}, {'word': 'ي', 'score': 0.9319471716880798, 'entity': 'LOC'}]

📚 詳細文檔

評估集指標

40 種語言的平均指標

文檔數量：262300

           precision    recall  f1-score   support

      ORG       0.81      0.81      0.81    102452
      PER       0.90      0.91      0.91    108978
      LOC       0.86      0.89      0.87    121868

micro avg       0.86      0.87      0.87    333298
macro avg       0.86      0.87      0.87    333298

南非荷蘭語（Afrikaans）

文檔數量：1000

           precision    recall  f1-score   support

      ORG       0.89      0.88      0.88       582
      PER       0.89      0.97      0.93       369
      LOC       0.84      0.90      0.86       518

micro avg       0.87      0.91      0.89      1469
macro avg       0.87      0.91      0.89      1469

阿拉伯語（Arabic）

文檔數量：10000

           precision    recall  f1-score   support

      ORG       0.83      0.84      0.84      3507
      PER       0.90      0.91      0.91      3643
      LOC       0.88      0.89      0.88      3604

micro avg       0.87      0.88      0.88     10754
macro avg       0.87      0.88      0.88     10754

巴斯克語（Basque）

文檔數量：10000

           precision    recall  f1-score   support

      LOC       0.88      0.93      0.91      5228
      ORG       0.86      0.81      0.83      3654
      PER       0.91      0.91      0.91      4072

micro avg       0.89      0.89      0.89     12954
macro avg       0.89      0.89      0.89     12954

孟加拉語（Bengali）

文檔數量：1000

           precision    recall  f1-score   support

      ORG       0.86      0.89      0.87       325
      LOC       0.91      0.91      0.91       406
      PER       0.96      0.95      0.95       364

micro avg       0.91      0.92      0.91      1095
macro avg       0.91      0.92      0.91      1095

保加利亞語（Bulgarian）

文檔數量：1000

           precision    recall  f1-score   support

      ORG       0.86      0.83      0.84      3661
      PER       0.92      0.95      0.94      4006
      LOC       0.92      0.95      0.94      6449

micro avg       0.91      0.92      0.91     14116
macro avg       0.91      0.92      0.91     14116

緬甸語（Burmese）

文檔數量：100

           precision    recall  f1-score   support

      LOC       0.60      0.86      0.71        37
      ORG       0.68      0.63      0.66        30
      PER       0.44      0.44      0.44        36

micro avg       0.57      0.65      0.61       103
macro avg       0.57      0.65      0.60       103

中文（Chinese）

文檔數量：10000

           precision    recall  f1-score   support

      ORG       0.70      0.69      0.70      4022
      LOC       0.76      0.81      0.78      3830
      PER       0.84      0.84      0.84      3706

micro avg       0.76      0.78      0.77     11558
macro avg       0.76      0.78      0.77     11558

荷蘭語（Dutch）

文檔數量：10000

           precision    recall  f1-score   support

      ORG       0.87      0.87      0.87      3930
      PER       0.95      0.95      0.95      4377
      LOC       0.91      0.92      0.91      4813

micro avg       0.91      0.92      0.91     13120
macro avg       0.91      0.92      0.91     13120

英語（English）

文檔數量：10000

           precision    recall  f1-score   support

      LOC       0.83      0.84      0.84      4781
      PER       0.89      0.90      0.89      4559
      ORG       0.75      0.75      0.75      4633

micro avg       0.82      0.83      0.83     13973
macro avg       0.82      0.83      0.83     13973

愛沙尼亞語（Estonian）

文檔數量：10000

           precision    recall  f1-score   support

      LOC       0.89      0.92      0.91      5654
      ORG       0.85      0.85      0.85      3878
      PER       0.94      0.94      0.94      4026

micro avg       0.90      0.91      0.90     13558
macro avg       0.90      0.91      0.90     13558

芬蘭語（Finnish）

文檔數量：10000

           precision    recall  f1-score   support

      ORG       0.84      0.83      0.84      4104
      LOC       0.88      0.90      0.89      5307
      PER       0.95      0.94      0.94      4519

micro avg       0.89      0.89      0.89     13930
macro avg       0.89      0.89      0.89     13930

法語（French）

文檔數量：10000

           precision    recall  f1-score   support

      LOC       0.90      0.89      0.89      4808
      ORG       0.84      0.87      0.85      3876
      PER       0.94      0.93      0.94      4249

micro avg       0.89      0.90      0.90     12933
macro avg       0.89      0.90      0.90     12933

格魯吉亞語（Georgian）

文檔數量：10000

           precision    recall  f1-score   support

      PER       0.90      0.91      0.90      3964
      ORG       0.83      0.77      0.80      3757
      LOC       0.82      0.88      0.85      4894

micro avg       0.84      0.86      0.85     12615
macro avg       0.84      0.86      0.85     12615

德語（German）

文檔數量：10000

           precision    recall  f1-score   support

      LOC       0.85      0.90      0.87      4939
      PER       0.94      0.91      0.92      4452
      ORG       0.79      0.78      0.79      4247

micro avg       0.86      0.86      0.86     13638
macro avg       0.86      0.86      0.86     13638

希臘語（Greek）

文檔數量：10000

           precision    recall  f1-score   support

      ORG       0.86      0.85      0.85      3771
      LOC       0.88      0.91      0.90      4436
      PER       0.91      0.93      0.92      3894

micro avg       0.88      0.90      0.89     12101
macro avg       0.88      0.90      0.89     12101

希伯來語（Hebrew）

文檔數量：10000

           precision    recall  f1-score   support

      PER       0.87      0.88      0.87      4206
      ORG       0.76      0.75      0.76      4190
      LOC       0.85      0.85      0.85      4538

micro avg       0.83      0.83      0.83     12934
macro avg       0.82      0.83      0.83     12934

印地語（Hindi）

文檔數量：1000

           precision    recall  f1-score   support

      ORG       0.78      0.81      0.79       362
      LOC       0.83      0.85      0.84       422
      PER       0.90      0.95      0.92       427

micro avg       0.84      0.87      0.85      1211
macro avg       0.84      0.87      0.85      1211

匈牙利語（Hungarian）

文檔數量：10000

           precision    recall  f1-score   support

      PER       0.95      0.95      0.95      4347
      ORG       0.87      0.88      0.87      3988
      LOC       0.90      0.92      0.91      5544

micro avg       0.91      0.92      0.91     13879
macro avg       0.91      0.92      0.91     13879

印尼語（Indonesian）

文檔數量：10000

           precision    recall  f1-score   support

      ORG       0.88      0.89      0.88      3735
      LOC       0.93      0.95      0.94      3694
      PER       0.93      0.93      0.93      3947

micro avg       0.91      0.92      0.92     11376
macro avg       0.91      0.92      0.92     11376

意大利語（Italian）

文檔數量：10000

           precision    recall  f1-score   support

      LOC       0.88      0.88      0.88      4592
      ORG       0.86      0.86      0.86      4088
      PER       0.96      0.96      0.96      4732

micro avg       0.90      0.90      0.90     13412
macro avg       0.90      0.90      0.90     13412

日語（Japanese）

文檔數量：10000

           precision    recall  f1-score   support

      ORG       0.62      0.61      0.62      4184
      PER       0.76      0.81      0.78      3812
      LOC       0.68      0.74      0.71      4281

micro avg       0.69      0.72      0.70     12277
macro avg       0.69      0.72      0.70     12277

爪哇語（Javanese）

文檔數量：100

           precision    recall  f1-score   support

      ORG       0.79      0.80      0.80        46
      PER       0.81      0.96      0.88        26
      LOC       0.75      0.75      0.75        40

micro avg       0.78      0.82      0.80       112
macro avg       0.78      0.82      0.80       112

哈薩克語（Kazakh）

文檔數量：1000

           precision    recall  f1-score   support

      ORG       0.76      0.61      0.68       307
      LOC       0.78      0.90      0.84       461
      PER       0.87      0.91      0.89       367

micro avg       0.81      0.83      0.82      1135
macro avg       0.81      0.83      0.81      1135

韓語（Korean）

文檔數量：10000

           precision    recall  f1-score   support

      LOC       0.86      0.89      0.88      5097
      ORG       0.79      0.74      0.77      4218
      PER       0.83      0.86      0.84      4014

micro avg       0.83      0.83      0.83     13329
macro avg       0.83      0.83      0.83     13329

馬來語（Malay）

文檔數量：1000

           precision    recall  f1-score   support

      ORG       0.87      0.89      0.88       368
      PER       0.92      0.91      0.91       366
      LOC       0.94      0.95      0.95       354

micro avg       0.91      0.92      0.91      1088
macro avg       0.91      0.92      0.91      1088

馬拉雅拉姆語（Malayalam）

文檔數量：1000

           precision    recall  f1-score   support

      ORG       0.75      0.74      0.75       347
      PER       0.84      0.89      0.86       417
      LOC       0.74      0.75      0.75       391

micro avg       0.78      0.80      0.79      1155
macro avg       0.78      0.80      0.79      1155

馬拉地語（Marathi）

文檔數量：1000

           precision    recall  f1-score   support

      PER       0.89      0.94      0.92       394
      LOC       0.82      0.84      0.83       457
      ORG       0.84      0.78      0.81       339

micro avg       0.85      0.86      0.85      1190
macro avg       0.85      0.86      0.85      1190

波斯語（Persian）

文檔數量：10000

           precision    recall  f1-score   support

      PER       0.93      0.92      0.93      3540
      LOC       0.93      0.93      0.93      3584
      ORG       0.89      0.92      0.90      3370

micro avg       0.92      0.92      0.92     10494
macro avg       0.92      0.92      0.92     10494

葡萄牙語（Portuguese）

文檔數量：10000

           precision    recall  f1-score   support

      LOC       0.90      0.91      0.91      4819
      PER       0.94      0.92      0.93      4184
      ORG       0.84      0.88      0.86      3670

micro avg       0.89      0.91      0.90     12673
macro avg       0.90      0.91      0.90     12673

俄語（Russian）

文檔數量：10000

           precision    recall  f1-score   support

      PER       0.93      0.96      0.95      3574
      LOC       0.87      0.89      0.88      4619
      ORG       0.82      0.80      0.81      3858

micro avg       0.87      0.88      0.88     12051
macro avg       0.87      0.88      0.88     12051

西班牙語（Spanish）

文檔數量：10000

           precision    recall  f1-score   support

      PER       0.95      0.93      0.94      3891
      ORG       0.86      0.88      0.87      3709
      LOC       0.89      0.91      0.90      4553

micro avg       0.90      0.91      0.90     12153
macro avg       0.90      0.91      0.90     12153

斯瓦希里語（Swahili）

文檔數量：1000

           precision    recall  f1-score   support

      ORG       0.82      0.85      0.83       349
      PER       0.95      0.92      0.94       403
      LOC       0.86      0.89      0.88       450

micro avg       0.88      0.89      0.88      1202
macro avg       0.88      0.89      0.88      1202

他加祿語（Tagalog）

文檔數量：1000

           precision    recall  f1-score   support

      LOC       0.90      0.91      0.90       338
      ORG       0.83      0.91      0.87       339
      PER       0.96      0.93      0.95       350

micro avg       0.90      0.92      0.91      1027
macro avg       0.90      0.92      0.91      1027

泰米爾語（Tamil）

文檔數量：1000

           precision    recall  f1-score   support

      PER       0.90      0.92      0.91       392
      ORG       0.77      0.76      0.76       370
      LOC       0.78      0.81      0.79       421

micro avg       0.82      0.83      0.82      1183
macro avg       0.82      0.83      0.82      1183

泰盧固語（Telugu）

文檔數量：1000

           precision    recall  f1-score   support

      ORG       0.67      0.55      0.61       347
      LOC       0.78      0.87      0.82       453
      PER       0.73      0.86      0.79       393

micro avg       0.74      0.77      0.76      1193
macro avg       0.73      0.77      0.75      1193

泰語（Thai）

文檔數量：10000

           precision    recall  f1-score   support

      LOC       0.63      0.76      0.69      3928
      PER       0.78      0.83      0.80      6537
      ORG       0.59      0.59      0.59      4257

micro avg       0.68      0.74      0.71     14722
macro avg       0.68      0.74      0.71     14722

土耳其語（Turkish）

文檔數量：10000

           precision    recall  f1-score   support

      PER       0.94      0.94      0.94      4337
      ORG       0.88      0.89      0.88      4094
      LOC       0.90      0.92      0.91      4929

micro avg       0.90      0.92      0.91     13360
macro avg       0.91      0.92      0.91     13360

烏爾都語（Urdu）

文檔數量：1000

           precision    recall  f1-score   support

      LOC       0.90      0.95      0.93       352
      PER       0.96      0.96      0.96       333
      ORG       0.91      0.90      0.90       326

micro avg       0.92      0.94      0.93      1011
macro avg       0.92      0.94      0.93      1011

越南語（Vietnamese）

文檔數量：10000

           precision    recall  f1-score   support

      ORG       0.86      0.87      0.86      3579
      LOC       0.88      0.91      0.90      3811
      PER       0.92      0.93      0.93      3717

micro avg       0.89      0.90      0.90     11107
macro avg       0.89      0.90      0.90     11107

約魯巴語（Yoruba）

文檔數量：100

           precision    recall  f1-score   support

      LOC       0.54      0.72      0.62        36
      ORG       0.58      0.31      0.41        35
      PER       0.77      1.00      0.87        36

micro avg       0.64      0.68      0.66       107
macro avg       0.63      0.68      0.63       107