wikineural-multilingual-ner开源模型 - 免费部署实现9种语言命名实体识别

首页

Wikineural Multilingual Ner

由 Babelscape 开发

基于神经网络与知识库融合的多语言命名实体识别模型，支持9种语言

序列标注

Transformers

支持多种语言#多语言NER #维基百科适配 #知识库增强

下载量 258.08k

发布时间 : 3/2/2022

模型简介

该模型通过结合神经网络和知识库方法，在维基百科自动构建的多语言NER数据集上训练，专门用于识别文本中的命名实体。

模型特点

多语言支持

支持9种语言的命名实体识别，包括主要欧洲语言

知识库增强

结合维基百科知识库信息提升识别准确率

联合训练模式

采用9种语言联合训练，提升模型泛化能力

模型能力

识别文本中的人名

识别文本中的地名

识别文本中的组织机构名

多语言文本处理

使用案例

信息提取

维基百科文本分析

从维基百科文章中提取命名实体

可有效识别维基百科风格文本中的实体

多语言文档处理

处理包含多种语言的文档中的命名实体

🚀 WikiNEuRal：用于多语言命名实体识别的神经与基于知识的银数据联合创建

WikiNEuRal是一个用于多语言命名实体识别（NER）的模型，它结合了神经方法和基于知识的方法来创建高质量的训练语料库。该模型在WikiNEuRal数据集上微调了多语言语言模型（mBERT），支持9种语言，为多语言NER任务提供了有效的解决方案。

🚀 快速开始

本项目是EMNLP 2021论文 WikiNEuRal: Combined Neural and Knowledge-based Silver Data Creation for Multilingual NER 的模型卡片。我们在 WikiNEuRal数据集上对多语言语言模型（mBERT）进行了3个轮次的微调，用于命名实体识别（NER）任务。最终得到的多语言NER模型支持WikiNEuRal涵盖的9种语言（德语、英语、西班牙语、法语、意大利语、荷兰语、波兰语、葡萄牙语、俄语），并且是在这9种语言上联合训练的。

如果您使用了该模型，请在论文中引用此工作：

@inproceedings{tedeschi-etal-2021-wikineural-combined,
    title = "{W}iki{NE}u{R}al: {C}ombined Neural and Knowledge-based Silver Data Creation for Multilingual {NER}",
    author = "Tedeschi, Simone  and
      Maiorca, Valentino  and
      Campolungo, Niccol{\`o}  and
      Cecconi, Francesco  and
      Navigli, Roberto",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2021",
    month = nov,
    year = "2021",
    address = "Punta Cana, Dominican Republic",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.findings-emnlp.215",
    pages = "2521--2533",
    abstract = "Multilingual Named Entity Recognition (NER) is a key intermediate task which is needed in many areas of NLP. In this paper, we address the well-known issue of data scarcity in NER, especially relevant when moving to a multilingual scenario, and go beyond current approaches to the creation of multilingual silver data for the task. We exploit the texts of Wikipedia and introduce a new methodology based on the effective combination of knowledge-based approaches and neural models, together with a novel domain adaptation technique, to produce high-quality training corpora for NER. We evaluate our datasets extensively on standard benchmarks for NER, yielding substantial improvements up to 6 span-based F1-score points over previous state-of-the-art systems for data creation.",
}

论文的原始仓库可在 https://github.com/Babelscape/wikineural 找到。

✨ 主要特性

多语言支持：支持德语、英语、西班牙语、法语、意大利语、荷兰语、波兰语、葡萄牙语、俄语9种语言。
联合训练：在9种语言上联合训练，提高了模型的多语言处理能力。
数据创新：结合神经方法和基于知识的方法创建训练语料库，解决了多语言NER数据稀缺的问题。

📦 安装指南

文档未提及安装步骤，故跳过该章节。

💻 使用示例

基础用法

您可以使用Transformers的 pipeline 来使用该模型进行命名实体识别。

from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

tokenizer = AutoTokenizer.from_pretrained("Babelscape/wikineural-multilingual-ner")
model = AutoModelForTokenClassification.from_pretrained("Babelscape/wikineural-multilingual-ner")

nlp = pipeline("ner", model=model, tokenizer=tokenizer, grouped_entities=True)
example = "My name is Wolfgang and I live in Berlin"

ner_results = nlp(example)
print(ner_results)

📚 详细文档

模型信息

属性	详情
标注创建者	机器生成
语言创建者	机器生成
标签	命名实体识别、序列标注模型
数据集	Babelscape/wikineural
支持语言	德语、英语、西班牙语、法语、意大利语、荷兰语、波兰语、葡萄牙语、俄语、多语言
许可证	CC BY-NC-SA 4.0
任务类别	结构预测
任务ID	命名实体识别

局限性和偏差

该模型在WikiNEuRal数据集上进行训练，这是一个从Wikipedia自动衍生的最先进的多语言NER数据集。因此，它可能无法很好地泛化到所有文本类型（例如新闻）。另一方面，仅在新闻文章上训练的模型（例如仅在CoNLL03上训练）在百科文章上的得分要低得多。为了获得更强大的系统，我们建议您将WikiNEuRal与其他数据集（例如WikiNEuRal + CoNLL）结合起来训练系统。