keyphrase-extraction-distilbert-openkpオープンソース英文キーフレーズ抽出モデル

ホーム

Keyphrase Extraction Distilbert Openkp

ml6teamによって開発

DistilBERTアーキテクチャに基づく英文キーワード抽出モデルで、OpenKPデータセットで微調整され、テキスト内のキーフレーズを自動的に識別します。

シーケンスラベリング

Transformers

英語オープンソースライセンス:MIT #英文キーワード抽出 #ウェブコンテンツ分析 #シーケンスラベリングモデル

ダウンロード数 32

リリース時間 : 3/25/2022

モデル概要

このモデルはテキスト内容を分析して重要なキーワードフレーズを自動抽出し、ユーザーが全文を読むことなくドキュメントの核心内容を素早く理解できるように支援します。ドキュメント要約や情報検索などのシナリオに適しています。

モデル特徴

効率的なキーワード抽出

テキストからキーフレーズを迅速かつ正確に抽出でき、ドキュメント処理効率を大幅に向上させます。

深層学習サポート

ニューラルネットワークアーキテクチャを採用しており、従来の方法に比べてテキストの意味情報や文脈関係をより良く捕捉できます。

軽量モデル

DistilBERTアーキテクチャに基づいており、性能を維持しながら計算リソースの要求を低減しています。

モデル能力

自動キーワード抽出

テキスト意味分析

ドキュメント内容要約

使用事例

情報処理

ドキュメント要約生成

ドキュメントのキー情報を自動抽出して簡潔な要約を生成

ユーザーがドキュメントの核心内容を素早く把握できるように支援

検索エンジン最適化

ウェブコンテンツからSEO最適化のためのキーワードを抽出

検索結果におけるウェブページの関連性ランキングを向上

コンテンツ分析

ニュースホットトピック分析

ニュース記事からキーワードを抽出してホットトピックを識別

メディア監視やトレンド分析を支援

🚀 キーフレーズ抽出モデル: distilbert - openkp

キーフレーズ抽出は、文書から重要なキーフレーズを抽出するテキスト分析技術です。これにより、人は文書全体を読まずに内容を迅速かつ簡単に理解できます。

🚀 クイックスタート

キーフレーズ抽出は、文書から重要なキーフレーズを抽出するテキスト分析の技術です。これらのキーフレーズにより、人は文書全体を読まずにテキストの内容を非常に迅速かつ簡単に理解することができます。キーフレーズ抽出は当初、主に人間のアノテーターが行っていました。彼らはテキストを詳細に読み、最も重要なキーフレーズを書き留めました。ただし、大量の文書を扱う場合、このプロセスには多くの時間がかかるという欠点があります⏳。

ここで人工知能🤖が登場します。現在、統計的および言語的特徴を使用する古典的な機械学習手法が、抽出プロセスに広く使用されています。今ではディープラーニングにより、これらの古典的な手法よりもテキストの意味をより良く捉えることが可能になりまし。古典的な手法はテキスト内の単語の頻度、出現回数、順序を見ますが、これらのニューラル手法はテキスト内の単語の長期的な意味的依存関係と文脈を捉えることができます。

✨ 主な機能

📓 モデルの説明

このモデルは、[KBIR](https://huggingface.co/distilbert - base - uncased)をベースモデルとして使用し、OpenKPデータセットでファインチューニングされています。

キーフレーズ抽出モデルは、文書内の各単語がキーフレーズの一部かどうかを分類するトークン分類問題としてファインチューニングされたトランスフォーマーモデルです。

ラベル	説明
B - KEY	キーフレーズの先頭
I - KEY	キーフレーズの中
O	キーフレーズの外

✋ 想定される使用法と制限

🛑 制限事項

予測されるキーフレーズの数が限られています。
英語の文書にのみ対応しています。

❓ 使い方

from transformers import (
    TokenClassificationPipeline,
    AutoModelForTokenClassification,
    AutoTokenizer,
)
from transformers.pipelines import AggregationStrategy
import numpy as np

# Define keyphrase extraction pipeline
class KeyphraseExtractionPipeline(TokenClassificationPipeline):
    def __init__(self, model, *args, **kwargs):
        super().__init__(
            model=AutoModelForTokenClassification.from_pretrained(model),
            tokenizer=AutoTokenizer.from_pretrained(model),
            *args,
            **kwargs
        )

    def postprocess(self, all_outputs):
        results = super().postprocess(
            all_outputs=all_outputs,
            aggregation_strategy=AggregationStrategy.FIRST,
        )
        return np.unique([result.get("word").strip() for result in results])

# Load pipeline
model_name = "ml6team/keyphrase-extraction-distilbert-openkp"
extractor = KeyphraseExtractionPipeline(model=model_name)

# Inference
text = """
Keyphrase extraction is a technique in text analysis where you extract the
important keyphrases from a document. Thanks to these keyphrases humans can
understand the content of a text very quickly and easily without reading it
completely. Keyphrase extraction was first done primarily by human annotators,
who read the text in detail and then wrote down the most important keyphrases.
The disadvantage is that if you work with a lot of documents, this process
can take a lot of time. 

Here is where Artificial Intelligence comes in. Currently, classical machine
learning methods, that use statistical and linguistic features, are widely used
for the extraction process. Now with deep learning, it is possible to capture
the semantic meaning of a text even better than these classical methods.
Classical methods look at the frequency, occurrence and order of words
in the text, whereas these neural approaches can capture long-term
semantic dependencies and context of words in a text.
""".replace("\n", " ")

keyphrases = extractor(text)

print(keyphrases)

# Output
['keyphrase extraction' 'text analysis']

📦 インストール

📚 学習データセット

OpenKPは、大規模なオープンドメインのキーフレーズ抽出データセットで、148,124の実世界のウェブ文書と、1 - 3の最も関連性の高い人間によるアノテーション付きのキーフレーズが含まれています。

詳細な情報は論文で確認できます。

👷‍♂️ 学習手順

学習パラメータ

パラメータ	値
学習率	1e - 4
エポック数	50
早期終了の許容回数	3

前処理

データセット内の文書はすでに単語のリストと対応するラベルに前処理されています。行う必要があるのは、トークン化とラベルの再調整だけで、正しいサブワードトークンに対応するようにします。

from datasets import load_dataset
from transformers import AutoTokenizer

# Labels
label_list = ["B", "I", "O"]
lbl2idx = {"B": 0, "I": 1, "O": 2}
idx2label = {0: "B", 1: "I", 2: "O"}

# Tokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
max_length = 512

# Dataset parameters
dataset_full_name = "midas/openkp"
dataset_subset = "raw"
dataset_document_column = "document"
dataset_biotags_column = "doc_bio_tags"

def preprocess_fuction(all_samples_per_split):
    tokenized_samples = tokenizer.batch_encode_plus(
        all_samples_per_split[dataset_document_column],
        padding="max_length",
        truncation=True,
        is_split_into_words=True,
        max_length=max_length,
    )
    total_adjusted_labels = []
    for k in range(0, len(tokenized_samples["input_ids"])):
        prev_wid = -1
        word_ids_list = tokenized_samples.word_ids(batch_index=k)
        existing_label_ids = all_samples_per_split[dataset_biotags_column][k]
        i = -1
        adjusted_label_ids = []

        for wid in word_ids_list:
            if wid is None:
                adjusted_label_ids.append(lbl2idx["O"])
            elif wid != prev_wid:
                i = i + 1
                adjusted_label_ids.append(lbl2idx[existing_label_ids[i]])
                prev_wid = wid
            else:
                adjusted_label_ids.append(
                    lbl2idx[
                        f"{'I' if existing_label_ids[i] == 'B' else existing_label_ids[i]}"
                    ]
                )

        total_adjusted_labels.append(adjusted_label_ids)
    tokenized_samples["labels"] = total_adjusted_labels
    return tokenized_samples

# Load dataset
dataset = load_dataset(dataset_full_name, dataset_subset)

# Preprocess dataset
tokenized_dataset = dataset.map(preprocess_fuction, batched=True)

後処理（パイプライン関数を使用しない場合）

パイプライン関数を使用しない場合は、BおよびIラベル付きのトークンをフィルタリングする必要があります。各BとIはキーフレーズにマージされます。最後に、不要なスペースがすべて削除されるようにキーフレーズをトリミングする必要があります。

# Define post_process functions
def concat_tokens_by_tag(keyphrases):
    keyphrase_tokens = []
    for id, label in keyphrases:
        if label == "B":
            keyphrase_tokens.append([id])
        elif label == "I":
            if len(keyphrase_tokens) > 0:
                keyphrase_tokens[len(keyphrase_tokens) - 1].append(id)
    return keyphrase_tokens


def extract_keyphrases(example, predictions, tokenizer, index=0):
    keyphrases_list = [
        (id, idx2label[label])
        for id, label in zip(
            np.array(example["input_ids"]).squeeze().tolist(), predictions[index]
        )
        if idx2label[label] in ["B", "I"]
    ]

    processed_keyphrases = concat_tokens_by_tag(keyphrases_list)
    extracted_kps = tokenizer.batch_decode(
        processed_keyphrases,
        skip_special_tokens=True,
        clean_up_tokenization_spaces=True,
    )
    return np.unique([kp.strip() for kp in extracted_kps])