keyphrase-extraction-kbir-inspecオープンソースキーワード抽出モデル - テキストのキーワードグループを高精度で識別请注意，对于类似 keyphrase-extraction-kbir-inspec 这样比较特定的术语，一般来说采取直接使用原文的方式，在日文中也比较常见，以保持专业性和唯一性。如果有对应的日文正式译法或其他特别要求，请告知。

ホーム

Keyphrase Extraction Kbir Inspec

ml6teamによって開発

KBIR事前学習モデルをInspecデータセットでファインチューニングしたキーワード抽出モデルで、シーケンスラベリング手法を用いてテキスト中のキーワード群を識別します。

シーケンスラベリング

Transformers

英語オープンソースライセンス:MIT #科学論文のキーワード抽出 #KBIR事前学習アーキテクチャ #シーケンスラベリング技術

ダウンロード数 22.12k

リリース時間 : 3/29/2022

モデル概要

このモデルはTransformerアーキテクチャによりキーワード抽出タスクをトークン分類問題としてモデル化し、英語の科学論文要約から正確にキーワードを抽出できます。

モデル特徴

マルチタスク事前学習フレームワーク

マスク言語モデリング(MLM)、キーワード境界埋め込み(KBI)、キーワード置換分類(KRC)を組み合わせた共同最適化

シーケンスラベリング手法

キーワード抽出をBIOタグシーケンス予測問題に変換し、キーワード群の境界情報を捕捉

ドメイン特化

コンピュータサイエンス分野のInspec論文データセットでファインチューニングされており、学術テキスト分析に適している

モデル能力

英語キーワード抽出

学術テキストの意味解析

長距離文脈依存性の捕捉

使用事例

学術研究

論文要約のキーワード自動索引付け

科学論文要約から核心用語を自動抽出し、手動索引付けを代替

F1@Mが0.564に達し、従来手法より効率が大幅向上

情報検索

学術文献索引構築

文献データベース向けに標準化されたキーワード索引を生成

🚀 キーフレーズ抽出モデル: KBIR-inspec

キーフレーズ抽出は、文書から重要なキーフレーズを抽出するテキスト分析技術です。これらのキーフレーズにより、人間は文書全体を読まずに内容を迅速かつ簡単に理解できます。以前は主に人間のアノテーターが文書を詳細に読み、重要なキーフレーズを書き留めていましたが、大量の文書を扱う場合、このプロセスには多くの時間がかかります。そこで人工知能が役立ちます。現在、統計的および言語学的特徴を利用した古典的な機械学習手法が抽出プロセスに広く使用されていますが、ディープラーニングにより、これらの古典的手法よりもテキストの意味をより良く捉えることが可能になりました。

🚀 クイックスタート

このキーフレーズ抽出モデルは、科学論文のアブストラクトに特化しており、その分野で高い性能を発揮します。以下に使用例を示します。

基本的な使用法

from transformers import (
    TokenClassificationPipeline,
    AutoModelForTokenClassification,
    AutoTokenizer,
)
from transformers.pipelines import AggregationStrategy
import numpy as np

# Define keyphrase extraction pipeline
class KeyphraseExtractionPipeline(TokenClassificationPipeline):
    def __init__(self, model, *args, **kwargs):
        super().__init__(
            model=AutoModelForTokenClassification.from_pretrained(model),
            tokenizer=AutoTokenizer.from_pretrained(model),
            *args,
            **kwargs
        )

    def postprocess(self, all_outputs):
        results = super().postprocess(
            all_outputs=all_outputs,
            aggregation_strategy=AggregationStrategy.SIMPLE,
        )
        return np.unique([result.get("word").strip() for result in results])

# Load pipeline
model_name = "ml6team/keyphrase-extraction-kbir-inspec"
extractor = KeyphraseExtractionPipeline(model=model_name)

# Inference
text = """
Keyphrase extraction is a technique in text analysis where you extract the
important keyphrases from a document. Thanks to these keyphrases humans can
understand the content of a text very quickly and easily without reading it
completely. Keyphrase extraction was first done primarily by human annotators,
who read the text in detail and then wrote down the most important keyphrases.
The disadvantage is that if you work with a lot of documents, this process
can take a lot of time. 

Here is where Artificial Intelligence comes in. Currently, classical machine
learning methods, that use statistical and linguistic features, are widely used
for the extraction process. Now with deep learning, it is possible to capture
the semantic meaning of a text even better than these classical methods.
Classical methods look at the frequency, occurrence and order of words
in the text, whereas these neural approaches can capture long-term
semantic dependencies and context of words in a text.
""".replace("\n", " ")

keyphrases = extractor(text)

print(keyphrases)

# Output
['Artificial Intelligence' 'Keyphrase extraction' 'deep learning'
 'linguistic features' 'machine learning' 'semantic meaning'
 'text analysis']

✨ 主な機能

科学論文のアブストラクトからのキーフレーズ抽出に特化した高精度なモデル。
ディープラーニングを利用して、古典的な機械学習手法よりもテキストの意味を良く捉えることができる。

📦 インストール

このモデルを使用するには、必要なライブラリをインストールする必要があります。以下のコマンドを使用してください。

pip install transformers datasets numpy

📚 ドキュメント

モデルの説明

このモデルは、KBIR をベースモデルとして使用し、Inspecデータセットでファインチューニングされています。KBIR（Keyphrase Boundary Infilling with Replacement）は、Masked Language Modeling (MLM)、Keyphrase Boundary Infilling (KBI)、およびKeyphrase Replacement Classification (KRC) の損失を最適化するマルチタスク学習セットアップを利用した事前学習モデルです。アーキテクチャの詳細については、この論文を参照してください。

キーフレーズ抽出モデルは、文書内の各単語がキーフレーズの一部であるかどうかを分類するトークン分類問題としてファインチューニングされたトランスフォーマーモデルです。

ラベル	説明
B-KEY	キーフレーズの先頭
I-KEY	キーフレーズの内部
O	キーフレーズの外部

トレーニングデータセット

Inspec は、1998年から2002年に出版されたコンピュータと制御、情報技術の科学分野の2000件の英語の科学論文から構成されるキーフレーズ抽出/生成データセットです。キーフレーズは、専門のインデクサーまたは編集者によってアノテーションされています。詳細については、論文を参照してください。

トレーニング手順

トレーニングパラメータ

パラメータ	値
学習率	1e-4
エポック数	50
早期終了の忍耐度	3

前処理

データセット内の文書は、すでに対応するラベル付きの単語リストに前処理されています。必要なのは、トークン化とラベルの再調整だけです。

from datasets import load_dataset
from transformers import AutoTokenizer

# Labels
label_list = ["B", "I", "O"]
lbl2idx = {"B": 0, "I": 1, "O": 2}
idx2label = {0: "B", 1: "I", 2: "O"}

# Tokenizer
tokenizer = AutoTokenizer.from_pretrained("bloomberg/KBIR", add_prefix_space=True)
max_length = 512

# Dataset parameters
dataset_full_name = "midas/inspec"
dataset_subset = "raw"
dataset_document_column = "document"
dataset_biotags_column = "doc_bio_tags"

def preprocess_fuction(all_samples_per_split):
    tokenized_samples = tokenizer.batch_encode_plus(
        all_samples_per_split[dataset_document_column],
        padding="max_length",
        truncation=True,
        is_split_into_words=True,
        max_length=max_length,
    )
    total_adjusted_labels = []
    for k in range(0, len(tokenized_samples["input_ids"])):
        prev_wid = -1
        word_ids_list = tokenized_samples.word_ids(batch_index=k)
        existing_label_ids = all_samples_per_split[dataset_biotags_column][k]
        i = -1
        adjusted_label_ids = []

        for wid in word_ids_list:
            if wid is None:
                adjusted_label_ids.append(lbl2idx["O"])
            elif wid != prev_wid:
                i = i + 1
                adjusted_label_ids.append(lbl2idx[existing_label_ids[i]])
                prev_wid = wid
            else:
                adjusted_label_ids.append(
                    lbl2idx[
                        f"{'I' if existing_label_ids[i] == 'B' else existing_label_ids[i]}"
                    ]
                )

        total_adjusted_labels.append(adjusted_label_ids)
    tokenized_samples["labels"] = total_adjusted_labels
    return tokenized_samples

# Load dataset
dataset = load_dataset(dataset_full_name, dataset_subset)

# Preprocess dataset
tokenized_dataset = dataset.map(preprocess_fuction, batched=True)

後処理（パイプライン関数を使用しない場合）

パイプライン関数を使用しない場合は、BおよびIラベル付きのトークンをフィルタリングし、キーフレーズを結合し、不要なスペースを削除する必要があります。

# Define post_process functions
def concat_tokens_by_tag(keyphrases):
    keyphrase_tokens = []
    for id, label in keyphrases:
        if label == "B":
            keyphrase_tokens.append([id])
        elif label == "I":
            if len(keyphrase_tokens) > 0:
                keyphrase_tokens[len(keyphrase_tokens) - 1].append(id)
    return keyphrase_tokens


def extract_keyphrases(example, predictions, tokenizer, index=0):
    keyphrases_list = [
        (id, idx2label[label])
        for id, label in zip(
            np.array(example["input_ids"]).squeeze().tolist(), predictions[index]
        )
        if idx2label[label] in ["B", "I"]
    ]

    processed_keyphrases = concat_tokens_by_tag(keyphrases_list)
    extracted_kps = tokenizer.batch_decode(
        processed_keyphrases,
        skip_special_tokens=True,
        clean_up_tokenization_spaces=True,
    )
    return np.unique([kp.strip() for kp in extracted_kps])

評価結果

従来の評価方法は、精度、再現率、およびF1スコア @k,m です。kは最初のk個の予測キーフレーズを表し、mは予測キーフレーズの平均量を表します。このモデルは、Inspecテストセットで以下の結果を達成しています。

データセット	P@5	R@5	F1@5	P@10	R@10	F1@10	P@M	R@M	F1@M
Inspecテストセット	0.53	0.47	0.46	0.36	0.58	0.41	0.58	0.60	0.56

🔧 技術詳細

このモデルでは、新しい事前学習目的であるKeyphrase Boundary Infilling with Replacement (KBIR) を導入しています。これにより、キーフレーズ抽出タスクでSOTAを大幅に上回る性能向上（F1で最大9.26ポイント）が得られます。また、生成設定では、BARTの新しい事前学習セットアップであるKeyBARTを導入し、キーフレーズ生成においてもSOTAを上回る性能向上（F1@Mで最大4.33ポイント）が得られています。