wikineural-multilingual-ner開源模型 - 免費部署實現9種語言命名實體識別

首頁

Wikineural Multilingual Ner

由Babelscape開發

基於神經網絡與知識庫融合的多語言命名實體識別模型，支持9種語言

序列標註

Transformers

支持多種語言#多語言NER #維基百科適配 #知識庫增強

下載量 258.08k

發布時間 : 3/2/2022

模型概述

該模型通過結合神經網絡和知識庫方法，在維基百科自動構建的多語言NER數據集上訓練，專門用於識別文本中的命名實體。

模型特點

多語言支持

支持9種語言的命名實體識別，包括主要歐洲語言

知識庫增強

結合維基百科知識庫信息提升識別準確率

聯合訓練模式

採用9種語言聯合訓練，提升模型泛化能力

模型能力

識別文本中的人名

識別文本中的地名

識別文本中的組織機構名

多語言文本處理

使用案例

信息提取

維基百科文本分析

從維基百科文章中提取命名實體

可有效識別維基百科風格文本中的實體

多語言文檔處理

處理包含多種語言的文檔中的命名實體

🚀 WikiNEuRal：用於多語言命名實體識別的神經與基於知識的銀數據聯合創建

WikiNEuRal是一個用於多語言命名實體識別（NER）的模型，它結合了神經方法和基於知識的方法來創建高質量的訓練語料庫。該模型在WikiNEuRal數據集上微調了多語言語言模型（mBERT），支持9種語言，為多語言NER任務提供了有效的解決方案。

🚀 快速開始

本項目是EMNLP 2021論文 WikiNEuRal: Combined Neural and Knowledge-based Silver Data Creation for Multilingual NER 的模型卡片。我們在 WikiNEuRal數據集上對多語言語言模型（mBERT）進行了3個輪次的微調，用於命名實體識別（NER）任務。最終得到的多語言NER模型支持WikiNEuRal涵蓋的9種語言（德語、英語、西班牙語、法語、意大利語、荷蘭語、波蘭語、葡萄牙語、俄語），並且是在這9種語言上聯合訓練的。

如果您使用了該模型，請在論文中引用此工作：

@inproceedings{tedeschi-etal-2021-wikineural-combined,
    title = "{W}iki{NE}u{R}al: {C}ombined Neural and Knowledge-based Silver Data Creation for Multilingual {NER}",
    author = "Tedeschi, Simone  and
      Maiorca, Valentino  and
      Campolungo, Niccol{\`o}  and
      Cecconi, Francesco  and
      Navigli, Roberto",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2021",
    month = nov,
    year = "2021",
    address = "Punta Cana, Dominican Republic",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.findings-emnlp.215",
    pages = "2521--2533",
    abstract = "Multilingual Named Entity Recognition (NER) is a key intermediate task which is needed in many areas of NLP. In this paper, we address the well-known issue of data scarcity in NER, especially relevant when moving to a multilingual scenario, and go beyond current approaches to the creation of multilingual silver data for the task. We exploit the texts of Wikipedia and introduce a new methodology based on the effective combination of knowledge-based approaches and neural models, together with a novel domain adaptation technique, to produce high-quality training corpora for NER. We evaluate our datasets extensively on standard benchmarks for NER, yielding substantial improvements up to 6 span-based F1-score points over previous state-of-the-art systems for data creation.",
}

論文的原始倉庫可在 https://github.com/Babelscape/wikineural 找到。

✨ 主要特性

多語言支持：支持德語、英語、西班牙語、法語、意大利語、荷蘭語、波蘭語、葡萄牙語、俄語9種語言。
聯合訓練：在9種語言上聯合訓練，提高了模型的多語言處理能力。
數據創新：結合神經方法和基於知識的方法創建訓練語料庫，解決了多語言NER數據稀缺的問題。

📦 安裝指南

文檔未提及安裝步驟，故跳過該章節。

💻 使用示例

基礎用法

您可以使用Transformers的 pipeline 來使用該模型進行命名實體識別。

from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

tokenizer = AutoTokenizer.from_pretrained("Babelscape/wikineural-multilingual-ner")
model = AutoModelForTokenClassification.from_pretrained("Babelscape/wikineural-multilingual-ner")

nlp = pipeline("ner", model=model, tokenizer=tokenizer, grouped_entities=True)
example = "My name is Wolfgang and I live in Berlin"

ner_results = nlp(example)
print(ner_results)

📚 詳細文檔

模型信息

屬性	詳情
標註創建者	機器生成
語言創建者	機器生成
標籤	命名實體識別、序列標註模型
數據集	Babelscape/wikineural
支持語言	德語、英語、西班牙語、法語、意大利語、荷蘭語、波蘭語、葡萄牙語、俄語、多語言
許可證	CC BY-NC-SA 4.0
任務類別	結構預測
任務ID	命名實體識別

侷限性和偏差

該模型在WikiNEuRal數據集上進行訓練，這是一個從Wikipedia自動衍生的最先進的多語言NER數據集。因此，它可能無法很好地泛化到所有文本類型（例如新聞）。另一方面，僅在新聞文章上訓練的模型（例如僅在CoNLL03上訓練）在百科文章上的得分要低得多。為了獲得更強大的系統，我們建議您將WikiNEuRal與其他數據集（例如WikiNEuRal + CoNLL）結合起來訓練系統。