keyphrase-extraction-distilbert-inspecオープンソースモデル - 英文科学論文の要約から無料でキーワードを抽出

ホーム

Keyphrase Extraction Distilbert Inspec

ml6teamによって開発

DistilBERTベースの英語キーワード抽出モデルで、科学論文要約分野で優れた性能を発揮します。

シーケンスラベリング

Transformers

英語オープンソースライセンス:MIT #英語キーワード抽出 #科学論文要約 #DistilBERTファインチューニング

ダウンロード数 22.07k

リリース時間 : 3/25/2022

モデル概要

このモデルはDistilBERTをファインチューニングしてキーフレーズのシーケンスラベリングを実現し、文書から重要なキーフレーズを自動抽出でき、テキスト内容の迅速な理解に適しています。

モデル特徴

分野特化

科学論文要約に最適化されており、コンピュータと制御分野で最高の性能を発揮

軽量アーキテクチャ

DistilBERTベースの圧縮モデルで、性能を維持しながら計算リソース要件を低減

シーケンスラベリング手法

BIOラベリングスキームを採用し、キーフレーズの境界を正確に捕捉

モデル能力

英語キーワード抽出

科学文献分析

意味情報捕捉

使用事例

学術研究

論文要約分析

研究論文の核心概念キーワードを自動抽出

F1@Mが0.49を達成

情報検索

ドキュメントインデックス構築

大量の文献に対して自動的に検索キーワードを生成

手動ラベリング比90%効率向上

🚀 キーフレーズ抽出モデル: distilbert-inspec

キーフレーズ抽出は、文書から重要なキーフレーズを抽出するテキスト分析の手法です。これらのキーフレーズにより、人間は文書を完全に読まなくても、内容を非常に迅速かつ簡単に理解することができます。キーフレーズ抽出は、当初主に人間のアノテーターによって行われていました。彼らは文書を詳細に読み、最も重要なキーフレーズを書き留めていました。ただし、大量の文書を扱う場合、このプロセスには多くの時間がかかるという欠点があります。

ここで人工知能が登場します🤖。現在、統計的および言語学的特徴を利用する古典的な機械学習手法が、抽出プロセスに広く使用されています。今ではディープラーニングにより、これらの古典的手法よりもさらに文書の意味を捉えることが可能になりました。古典的手法は、文書内の単語の頻度、出現回数、順序を見ますが、これらのニューラルアプローチは、文書内の単語の長期的な意味的依存関係と文脈を捉えることができます。

🚀 クイックスタート

このキーフレーズ抽出モデルは、特定のドメインに特化しており、科学論文の要約で非常に高い性能を発揮します。以下に使用方法を示します。

✨ 主な機能

科学論文の要約からのキーフレーズ抽出に特化したモデルです。
ディープラーニングを利用して、文書の意味を捉えることができます。

📦 インストール

このモデルを使用するには、transformers ライブラリが必要です。以下のコマンドでインストールできます。

pip install transformers

💻 使用例

基本的な使用法

from transformers import (
    TokenClassificationPipeline,
    AutoModelForTokenClassification,
    AutoTokenizer,
)
from transformers.pipelines import AggregationStrategy
import numpy as np

# Define keyphrase extraction pipeline
class KeyphraseExtractionPipeline(TokenClassificationPipeline):
    def __init__(self, model, *args, **kwargs):
        super().__init__(
            model=AutoModelForTokenClassification.from_pretrained(model),
            tokenizer=AutoTokenizer.from_pretrained(model),
            *args,
            **kwargs
        )

    def postprocess(self, all_outputs):
        results = super().postprocess(
            all_outputs=all_outputs,
            aggregation_strategy=AggregationStrategy.FIRST,
        )
        return np.unique([result.get("word").strip() for result in results])

# Load pipeline
model_name = "ml6team/keyphrase-extraction-distilbert-inspec"
extractor = KeyphraseExtractionPipeline(model=model_name)

# Inference
text = """
Keyphrase extraction is a technique in text analysis where you extract the
important keyphrases from a document. Thanks to these keyphrases humans can
understand the content of a text very quickly and easily without reading it
completely. Keyphrase extraction was first done primarily by human annotators,
who read the text in detail and then wrote down the most important keyphrases.
The disadvantage is that if you work with a lot of documents, this process
can take a lot of time. 

Here is where Artificial Intelligence comes in. Currently, classical machine
learning methods, that use statistical and linguistic features, are widely used
for the extraction process. Now with deep learning, it is possible to capture
the semantic meaning of a text even better than these classical methods.
Classical methods look at the frequency, occurrence and order of words
in the text, whereas these neural approaches can capture long-term
semantic dependencies and context of words in a text.
""".replace("\n", " ")

keyphrases = extractor(text)

print(keyphrases)

出力結果

['artificial intelligence' 'classical machine learning' 'deep learning'
 'keyphrase extraction' 'linguistic features' 'statistical'
 'text analysis']

📚 ドキュメント

📓 モデルの説明

このモデルは、distilbert をベースモデルとして使用し、Inspecデータセットでファインチューニングされています。

キーフレーズ抽出モデルは、トークン分類問題としてファインチューニングされたトランスフォーマーモデルで、文書内の各単語がキーフレーズの一部であるかどうかを分類します。

ラベル	説明
B-KEY	キーフレーズの先頭
I-KEY	キーフレーズの内部
O	キーフレーズの外部

Kulkarni, Mayank, Debanjan Mahata, Ravneet Arora, and Rajarshi Bhowmik. "Learning Rich Representation of Keyphrases from Text." arXiv preprint arXiv:2112.08547 (2021).

Sahrawat, Dhruva, Debanjan Mahata, Haimin Zhang, Mayank Kulkarni, Agniv Sharma, Rakesh Gosangi, Amanda Stent, Yaman Kumar, Rajiv Ratn Shah, and Roger Zimmermann. "Keyphrase extraction as sequence labeling using contextualized embeddings." In European Conference on Information Retrieval, pp. 328-335. Springer, Cham, 2020.

✋ 使用目的と制限事項

🛑 制限事項

このキーフレーズ抽出モデルは非常にドメイン特化しており、科学論文の要約で非常に高い性能を発揮します。他のドメインでの使用はお勧めしませんが、試すことは自由です。
英語の文書のみに対応しています。

📚 学習データセット

Inspec は、1998年から2002年に出版されたコンピュータと制御、情報技術の科学分野の2000篇の英語の科学論文から構成されるキーフレーズ抽出/生成データセットです。キーフレーズは、専門のインデクサーまたは編集者によってアノテーション付けされています。

詳細な情報は、論文を参照してください。

👷‍♂️ 学習手順

学習パラメータ

パラメータ	値
学習率	1e-4
エポック数	50
早期終了の許容回数	3

前処理

データセット内の文書は、すでに対応するラベル付きの単語リストに前処理されています。行う必要があるのは、トークン化とラベルの再調整だけで、正しいサブワードトークンに対応するようにします。

from datasets import load_dataset
from transformers import AutoTokenizer

# Labels
label_list = ["B", "I", "O"]
lbl2idx = {"B": 0, "I": 1, "O": 2}
idx2label = {0: "B", 1: "I", 2: "O"}

# Tokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
max_length = 512

# Dataset parameters
dataset_full_name = "midas/inspec"
dataset_subset = "raw"
dataset_document_column = "document"
dataset_biotags_column = "doc_bio_tags"

def preprocess_fuction(all_samples_per_split):
    tokenized_samples = tokenizer.batch_encode_plus(
        all_samples_per_split[dataset_document_column],
        padding="max_length",
        truncation=True,
        is_split_into_words=True,
        max_length=max_length,
    )
    total_adjusted_labels = []
    for k in range(0, len(tokenized_samples["input_ids"])):
        prev_wid = -1
        word_ids_list = tokenized_samples.word_ids(batch_index=k)
        existing_label_ids = all_samples_per_split[dataset_biotags_column][k]
        i = -1
        adjusted_label_ids = []

        for wid in word_ids_list:
            if wid is None:
                adjusted_label_ids.append(lbl2idx["O"])
            elif wid != prev_wid:
                i = i + 1
                adjusted_label_ids.append(lbl2idx[existing_label_ids[i]])
                prev_wid = wid
            else:
                adjusted_label_ids.append(
                    lbl2idx[
                        f"{'I' if existing_label_ids[i] == 'B' else existing_label_ids[i]}"
                    ]
                )

        total_adjusted_labels.append(adjusted_label_ids)
    tokenized_samples["labels"] = total_adjusted_labels
    return tokenized_samples

# Load dataset
dataset = load_dataset(dataset_full_name, dataset_subset)

# Preprocess dataset
tokenized_dataset = dataset.map(preprocess_fuction, batched=True)

後処理（パイプライン関数を使用しない場合）

パイプライン関数を使用しない場合、BおよびIラベル付きのトークンをフィルタリングする必要があります。各BとIは、キーフレーズにマージされます。最後に、不要なスペースが削除されるように、キーフレーズをトリミングする必要があります。

# Define post_process functions
def concat_tokens_by_tag(keyphrases):
    keyphrase_tokens = []
    for id, label in keyphrases:
        if label == "B":
            keyphrase_tokens.append([id])
        elif label == "I":
            if len(keyphrase_tokens) > 0:
                keyphrase_tokens[len(keyphrase_tokens) - 1].append(id)
    return keyphrase_tokens


def extract_keyphrases(example, predictions, tokenizer, index=0):
    keyphrases_list = [
        (id, idx2label[label])
        for id, label in zip(
            np.array(example["input_ids"]).squeeze().tolist(), predictions[index]
        )
        if idx2label[label] in ["B", "I"]
    ]

    processed_keyphrases = concat_tokens_by_tag(keyphrases_list)
    extracted_kps = tokenizer.batch_decode(
        processed_keyphrases,
        skip_special_tokens=True,
        clean_up_tokenization_spaces=True,
    )
    return np.unique([kp.strip() for kp in extracted_kps])