xlm-roberta-large-finetuned-conll03-english開源模型 - 免費實現英語命名實體識別

首頁

Xlm Roberta Large Finetuned Conll03 English

由FacebookAI開發

基於XLM-RoBERTa-large模型在英語conll2003數據集上微調的命名實體識別模型

序列標註支持多種語言#多語言NER #高精度實體識別 #CoNLL2003微調

下載量 84.75k

發布時間 : 3/2/2022

模型概述

該模型是XLM-RoBERTa-large的微調版本，專門用於英語文本的命名實體識別任務，能夠識別文本中的人名、地名等實體

模型特點

多語言預訓練

基於支持100種語言的XLM-RoBERTa-large模型

專業領域微調

在標準NER數據集conll2003上專門微調

高準確率

在英語NER任務上表現出色

模型能力

命名實體識別

文本標記分類

英語文本處理

使用案例

信息提取

新聞實體提取

從新聞文本中提取人名、地名等關鍵信息

可準確識別文本中的各類命名實體

文檔自動化處理

自動處理法律或醫療文檔中的實體信息

提高文檔處理效率

🚀 XLM-RoBERTa大模型微調CoNLL03英文數據集

本項目基於XLM-RoBERTa大模型，使用CoNLL03英文數據集進行微調，可用於英文的命名實體識別等自然語言處理任務。

🚀 快速開始

使用以下代碼開始使用該模型。你可以直接在命名實體識別（NER）的管道中使用此模型。

點擊展開

>>> from transformers import AutoTokenizer, AutoModelForTokenClassification
>>> from transformers import pipeline
>>> tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-large-finetuned-conll03-english")
>>> model = AutoModelForTokenClassification.from_pretrained("xlm-roberta-large-finetuned-conll03-english")
>>> classifier = pipeline("ner", model=model, tokenizer=tokenizer)
>>> classifier("Hello I'm Omar and I live in Zürich.")

[{'end': 14,
  'entity': 'I-PER',
  'index': 5,
  'score': 0.9999175,
  'start': 10,
  'word': '▁Omar'},
 {'end': 35,
  'entity': 'I-LOC',
  'index': 10,
  'score': 0.9999906,
  'start': 29,
  'word': '▁Zürich'}]

✨ 主要特性

模型描述

XLM-RoBERTa模型由Alexis Conneau、Kartikay Khandelwal、Naman Goyal、Vishrav Chaudhary、Guillaume Wenzek、Francisco Guzmán、Edouard Grave、Myle Ott、Luke Zettlemoyer和Veselin Stoyanov在論文Unsupervised Cross-lingual Representation Learning at Scale中提出。它基於Facebook在2019年發佈的RoBERTa模型，是一個大型多語言語言模型，在2.5TB經過過濾的CommonCrawl數據上進行訓練。此模型是XLM-RoBERTa-large使用conll2003英文數據集進行微調後的版本。

屬性	詳情
開發者	見相關論文
模型類型	多語言語言模型
語言	XLM-RoBERTa是一個在100種不同語言上訓練的多語言模型；完整列表見GitHub倉庫；該模型在英文數據集上進行了微調
許可證	需要更多信息
相關模型	RoBERTa，XLM
父模型	XLM-RoBERTa-large
更多信息資源	GitHub倉庫；相關論文

用途

直接使用

該模型是一個語言模型，可用於標記分類，這是一種自然語言理解任務，為文本中的某些標記分配標籤。

下游應用

潛在的下游用例包括命名實體識別（NER）和詞性標註（PoS）。要了解更多關於標記分類和其他潛在下游用例的信息，請參閱Hugging Face的標記分類文檔。

超出適用範圍的使用

該模型不應被用於故意為人們創造敵對或排斥的環境。

🔧 技術細節

偏差、風險和侷限性

⚠️ 重要提示

讀者應該意識到，該模型生成的語言可能會讓一些人感到不安或冒犯，並且可能會傳播歷史和當前的刻板印象。

大量研究已經探討了語言模型的偏差和公平性問題（例如，見Sheng等人（2021）和Bender等人（2021））。在與該模型相關的任務背景下，Mishra等人（2020）探討了英文NER系統中的社會偏差，發現現有的NER系統存在系統性偏差，即它們無法識別來自不同人口群體的命名實體（儘管該論文沒有研究BERT）。例如，使用Mishra等人（2020）中的一個示例句子：

>>> from transformers import pipeline
>>> tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-large-finetuned-conll03-english")
>>> model = AutoModelForTokenClassification.from_pretrained("xlm-roberta-large-finetuned-conll03-english")
>>> classifier = pipeline("ner", model=model, tokenizer=tokenizer)
>>> classifier("Alya told Jasmine that Andrew could pay with cash..")
[{'end': 2,
  'entity': 'I-PER',
  'index': 1,
  'score': 0.9997861,
  'start': 0,
  'word': '▁Al'},
 {'end': 4,
  'entity': 'I-PER',
  'index': 2,
  'score': 0.9998591,
  'start': 2,
  'word': 'ya'},
 {'end': 16,
  'entity': 'I-PER',
  'index': 4,
  'score': 0.99995816,
  'start': 10,
  'word': '▁Jasmin'},
 {'end': 17,
  'entity': 'I-PER',
  'index': 5,
  'score': 0.9999584,
  'start': 16,
  'word': 'e'},
 {'end': 29,
  'entity': 'I-PER',
  'index': 7,
  'score': 0.99998057,
  'start': 23,
  'word': '▁Andrew'}]

建議

用戶（包括直接用戶和下游用戶）應該瞭解該模型的風險、偏差和侷限性。

訓練

訓練數據和訓練過程的詳細信息請參閱以下資源：

評估

評估細節請參閱相關論文。

環境影響

可以使用Lacoste等人（2019）中提出的機器學習影響計算器來估算碳排放。

屬性	詳情
硬件類型	500個32GB的Nvidia V100 GPU（來自相關論文）
使用時長	需要更多信息
雲服務提供商	需要更多信息
計算區域	需要更多信息
碳排放	需要更多信息

技術規格

更多詳細信息請參閱相關論文。

📄 許可證

需要更多信息。

📚 詳細文檔

引用

BibTeX

@article{conneau2019unsupervised,
  title={Unsupervised Cross-lingual Representation Learning at Scale},
  author={Conneau, Alexis and Khandelwal, Kartikay and Goyal, Naman and Chaudhary, Vishrav and Wenzek, Guillaume and Guzm{\'a}n, Francisco and Grave, Edouard and Ott, Myle and Zettlemoyer, Luke and Stoyanov, Veselin},
  journal={arXiv preprint arXiv:1911.02116},
  year={2019}
}

APA

Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., ... & Stoyanov, V. (2019). Unsupervised cross-lingual representation learning at scale. arXiv preprint arXiv:1911.02116.