wikineural-multilingual-nerオープンソースモデル - 無料でデプロイして9種類の言語の命名エンティティ認識を実現

ホーム

Wikineural Multilingual Ner

Babelscapeによって開発

ニューラルネットワークと知識ベースを融合した多言語固有表現認識モデルで、9種類の言語をサポートします。

シーケンスラベリング

Transformers

複数言語対応#多言語NER #ウィキペディア適合 #知識ベース強化

ダウンロード数 258.08k

リリース時間 : 3/2/2022

モデル概要

このモデルは、ニューラルネットワークと知識ベースの方法を組み合わせ、ウィキペディアで自動構築された多言語NERデータセットで訓練され、テキスト中の固有表現を識別するために特別に設計されています。

モデル特徴

多言語サポート

主なヨーロッパ言語を含む9種類の言語の固有表現認識をサポートします。

知識ベース強化

ウィキペディアの知識ベース情報を組み合わせて認識精度を向上させます。

連合訓練モード

9種類の言語を連合訓練することで、モデルの汎化能力を向上させます。

モデル能力

テキスト中の人名を識別する

テキスト中の地名を識別する

テキスト中の組織名を識別する

多言語テキスト処理

使用事例

情報抽出

ウィキペディアテキスト分析

ウィキペディアの記事から固有表現を抽出する

ウィキペディアスタイルのテキスト中のエンティティを効果的に識別できます。

多言語ドキュメント処理

複数の言語を含むドキュメント中の固有表現を処理する

🚀 WikiNEuRal: 多言語NERのための結合型ニューラルと知識ベースの銀データ作成

このモデルは、2021年のEMNLP論文「WikiNEuRal: Combined Neural and Knowledge-based Silver Data Creation for Multilingual NER」に基づくものです。多言語の固有表現認識（NER）のために、WikiNEuRalデータセットを使用して多言語言語モデル（mBERT）を3エポックでファインチューニングしました。結果として得られた多言語NERモデルは、WikiNEuRalがカバーする9つの言語（de, en, es, fr, it, nl, pl, pt, ru）をサポートし、すべての言語で同時に学習されています。

🚀 クイックスタート

このモデルを使用するには、Transformersのpipelineを利用できます。

💻 使用例

基本的な使用法

from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

tokenizer = AutoTokenizer.from_pretrained("Babelscape/wikineural-multilingual-ner")
model = AutoModelForTokenClassification.from_pretrained("Babelscape/wikineural-multilingual-ner")

nlp = pipeline("ner", model=model, tokenizer=tokenizer, grouped_entities=True)
example = "My name is Wolfgang and I live in Berlin"

ner_results = nlp(example)
print(ner_results)

📚 ドキュメント

引用情報

もしこのモデルを使用する場合は、論文でこの研究を引用してください：

@inproceedings{tedeschi-etal-2021-wikineural-combined,
    title = "{W}iki{NE}u{R}al: {C}ombined Neural and Knowledge-based Silver Data Creation for Multilingual {NER}",
    author = "Tedeschi, Simone  and
      Maiorca, Valentino  and
      Campolungo, Niccol{\`o}  and
      Cecconi, Francesco  and
      Navigli, Roberto",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2021",
    month = nov,
    year = "2021",
    address = "Punta Cana, Dominican Republic",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.findings-emnlp.215",
    pages = "2521--2533",
    abstract = "Multilingual Named Entity Recognition (NER) is a key intermediate task which is needed in many areas of NLP. In this paper, we address the well-known issue of data scarcity in NER, especially relevant when moving to a multilingual scenario, and go beyond current approaches to the creation of multilingual silver data for the task. We exploit the texts of Wikipedia and introduce a new methodology based on the effective combination of knowledge-based approaches and neural models, together with a novel domain adaptation technique, to produce high-quality training corpora for NER. We evaluate our datasets extensively on standard benchmarks for NER, yielding substantial improvements up to 6 span-based F1-score points over previous state-of-the-art systems for data creation.",
}

論文の元のリポジトリはhttps://github.com/Babelscape/wikineuralにあります。

🔧 技術詳細

このモデルは、Wikipediaから自動的に派生した最先端の多言語NERデータセットであるWikiNEuRalを使用して学習されています。したがって、すべてのテキストジャンル（例：ニュース）に対して汎化性能が高いとは限りません。一方、ニュース記事のみ（例：CoNLL03のみ）で学習されたモデルは、百科事典記事でははるかに低いスコアしか得られないことが証明されています。より堅牢なシステムを構築するために、WikiNEuRalと他のデータセット（例：WikiNEuRal + CoNLL）を組み合わせてシステムを学習することをお勧めします。

📄 ライセンス

このリポジトリの内容は、Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0)の下で、非商用の研究目的のみに制限されています。データセットの内容とモデルの著作権は、元の著作権者に帰属します。

情報テーブル

属性	详情
注釈作成者	機械生成
言語作成者	機械生成
タグ	固有表現認識、シーケンスタガーモデル
データセット	Babelscape/wikineural
言語	de、en、es、fr、it、nl、pl、pt、ru、多言語
タスクカテゴリ	構造予測
タスクID	固有表現認識