keyphrase-extraction-kbir-openkpオープンソースモデル - 英文テキストから重要なキーフレーズを正確に抽出する

ホーム

Keyphrase Extraction Kbir Openkp

ml6teamによって開発

KBIRアーキテクチャに基づくキーフレーズ抽出モデルで、OpenKPデータセットで微調整され、英語テキストから重要なキーフレーズを抽出するために使用されます。

シーケンスラベリング

Transformers

英語オープンソースライセンス:MIT #マルチタスク学習フレームワーク #ウェブページの重要フレーズ抽出 #意味境界モデリング

ダウンロード数 90

リリース時間 : 6/16/2022

モデル概要

このモデルはキーフレーズ抽出タスクをトークン分類問題に変換し、各単語がキーフレーズの開始(B - KEY)、内部(I - KEY)、または外部(O)に属するかを判断することでキーフレーズを抽出します。

モデル特徴

マルチタスク学習フレームワーク

マスク言語モデリング(MLM)、キーフレーズ境界埋め込み(KBI)、およびキーフレーズ置換分類(KRC)の損失関数を連合最適化します。

意味理解

従来の単語頻度に基づく方法と比較して、テキストの長期的な意味依存関係と文脈をより良く捉えることができます。

効率的なアノテーション

キーフレーズの自動抽出により、大量の文書を手動でアノテーションする時間コストを大幅に削減します。

モデル能力

英語キーワード抽出

意味的なキーフレーズ識別

文書内容要約生成

使用事例

テキスト分析

文書の迅速な理解

抽出されたキーフレーズを通じて、文書の核心内容を完全に読まずに迅速に把握することができます。

情報検索の効率を向上させます。

内容索引の構築

大規模な文書集合に対して自動的にキーワード索引を生成します。

検索エンジンの効果を最適化します。

知識管理

学術文献分析

研究論文から核心概念と用語を抽出します。

文献レビューのプロセスを加速します。

🚀 キーフレーズ抽出モデル: KBIR-OpenKP

キーフレーズ抽出は、文書から重要なキーフレーズを抽出するテキスト分析技術です。これらのキーフレーズにより、人間は文書を完全に読まなくても、内容を迅速かつ簡単に理解できます。キーフレーズ抽出は当初、主に人間のアノテーターが文書を詳細に読み、最も重要なキーフレーズを書き留めることで行われていました。ただし、大量の文書を扱う場合、このプロセスには多くの時間がかかるという欠点があります。

ここで人工知能🤖が登場します。現在、統計的および言語的特徴を利用した古典的な機械学習手法が、抽出プロセスに広く使用されています。深層学習により、これらの古典的手法よりも文書の意味をより良く捉えることが可能になりまし。古典的手法は文書内の単語の頻度、出現回数、順序を見ますが、これらのニューラルアプローチは文書内の単語の長期的な意味的依存関係と文脈を捉えることができます。

🚀 クイックスタート

キーフレーズ抽出モデルを使用するには、以下の手順に従ってください。

from transformers import (
    TokenClassificationPipeline,
    AutoModelForTokenClassification,
    AutoTokenizer,
)
from transformers.pipelines import AggregationStrategy
import numpy as np

# Define keyphrase extraction pipeline
class KeyphraseExtractionPipeline(TokenClassificationPipeline):
    def __init__(self, model, *args, **kwargs):
        super().__init__(
            model=AutoModelForTokenClassification.from_pretrained(model),
            tokenizer=AutoTokenizer.from_pretrained(model),
            *args,
            **kwargs
        )

    def postprocess(self, all_outputs):
        results = super().postprocess(
            all_outputs=all_outputs,
            aggregation_strategy=AggregationStrategy.SIMPLE,
        )
        return np.unique([result.get("word").strip() for result in results])

# Load pipeline
model_name = "ml6team/keyphrase-extraction-kbir-openkp"
extractor = KeyphraseExtractionPipeline(model=model_name)

# Inference
text = """
Keyphrase extraction is a technique in text analysis where you extract the
important keyphrases from a document. Thanks to these keyphrases humans can
understand the content of a text very quickly and easily without reading it
completely. Keyphrase extraction was first done primarily by human annotators,
who read the text in detail and then wrote down the most important keyphrases.
The disadvantage is that if you work with a lot of documents, this process
can take a lot of time. 

Here is where Artificial Intelligence comes in. Currently, classical machine
learning methods, that use statistical and linguistic features, are widely used
for the extraction process. Now with deep learning, it is possible to capture
the semantic meaning of a text even better than these classical methods.
Classical methods look at the frequency, occurrence and order of words
in the text, whereas these neural approaches can capture long-term
semantic dependencies and context of words in a text.
""".replace("
", " ")

keyphrases = extractor(text)

print(keyphrases)

✨ 主な機能

文書から重要なキーフレーズを自動抽出します。
深層学習を利用して、文書の意味的依存関係と文脈を捉えます。

📦 インストール

このモデルを使用するには、必要なライブラリをインストールする必要があります。以下のコマンドを使用して、必要なライブラリをインストールしてください。

pip install transformers datasets numpy

💻 使用例

基本的な使用法

# 上記のクイックスタートのコードを参照してください。

高度な使用法

# 独自のデータセットでモデルを微調整する例
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForTokenClassification, TrainingArguments, Trainer

# データセットの読み込み
dataset = load_dataset("midas/openkp")

# トークナイザーの読み込み
tokenizer = AutoTokenizer.from_pretrained("bloomberg/KBIR")

# モデルの読み込み
model = AutoModelForTokenClassification.from_pretrained("ml6team/keyphrase-extraction-kbir-openkp")

# トレーニング引数の設定
training_args = TrainingArguments(
    output_dir='./results',          # output directory
    num_train_epochs=3,              # total number of training epochs
    per_device_train_batch_size=16,  # batch size per device during training
    per_device_eval_batch_size=64,   # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
    logging_steps=10,
)

# トレーナーの作成
trainer = Trainer(
    model=model,                         # the instantiated 🤗 Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=dataset['train'],      # training dataset
    eval_dataset=dataset['validation']   # evaluation dataset
)

# モデルのトレーニング
trainer.train()

📚 ドキュメント

モデルの説明

このモデルは、KBIRをベースモデルとして使用し、OpenKPデータセットで微調整されています。KBIRまたはKeyphrase Boundary Infilling with Replacementは、Masked Language Modeling (MLM)、Keyphrase Boundary Infilling (KBI)、およびKeyphrase Replacement Classification (KRC)の複合損失を最適化するためのマルチタスク学習セットアップを利用した事前学習モデルです。アーキテクチャに関する詳細情報は、この論文で見つけることができます。

キーフレーズ抽出モデルは、文書内の各単語がキーフレーズの一部であるかどうかを分類するトークン分類問題として微調整されたトランスフォーマーモデルです。

ラベル	説明
B-KEY	キーフレーズの先頭
I-KEY	キーフレーズの内部
O	キーフレーズの外部

想定される使用法と制限事項

制限事項

予測されるキーフレーズの数が制限されています。
英語の文書にのみ対応しています。

使い方

上記の使用例を参照してください。

トレーニングデータセット

OpenKPは、148,124の実世界のウェブ文書と、1 - 3の最も関連性の高い人間によるアノテーション付きのキーフレーズを含む大規模なオープンドメインのキーフレーズ抽出データセットです。詳細情報は、論文で見つけることができます。

トレーニング手順

トレーニングパラメータ

パラメータ	値
学習率	1e-4
エポック数	50
早期終了の忍耐度	3

前処理

データセット内の文書は、すでに対応するラベル付きの単語リストに前処理されています。行う必要がある唯一のことは、トークン化とラベルの再調整であり、正しいサブワードトークンに対応するようにします。

from datasets import load_dataset
from transformers import AutoTokenizer

# Labels
label_list = ["B", "I", "O"]
lbl2idx = {"B": 0, "I": 1, "O": 2}
idx2label = {0: "B", 1: "I", 2: "O"}

# Tokenizer
tokenizer = AutoTokenizer.from_pretrained("bloomberg/KBIR")
max_length = 512

# Dataset parameters
dataset_full_name = "midas/openkp"
dataset_subset = "raw"
dataset_document_column = "document"
dataset_biotags_column = "doc_bio_tags"

def preprocess_fuction(all_samples_per_split):
    tokenized_samples = tokenizer.batch_encode_plus(
        all_samples_per_split[dataset_document_column],
        padding="max_length",
        truncation=True,
        is_split_into_words=True,
        max_length=max_length,
    )
    total_adjusted_labels = []
    for k in range(0, len(tokenized_samples["input_ids"])):
        prev_wid = -1
        word_ids_list = tokenized_samples.word_ids(batch_index=k)
        existing_label_ids = all_samples_per_split[dataset_biotags_column][k]
        i = -1
        adjusted_label_ids = []

        for wid in word_ids_list:
            if wid is None:
                adjusted_label_ids.append(lbl2idx["O"])
            elif wid != prev_wid:
                i = i + 1
                adjusted_label_ids.append(lbl2idx[existing_label_ids[i]])
                prev_wid = wid
            else:
                adjusted_label_ids.append(
                    lbl2idx[
                        f"{'I' if existing_label_ids[i] == 'B' else existing_label_ids[i]}"
                    ]
                )

        total_adjusted_labels.append(adjusted_label_ids)
    tokenized_samples["labels"] = total_adjusted_labels
    return tokenized_samples

# Load dataset
dataset = load_dataset(dataset_full_name, dataset_subset)

# Preprocess dataset
tokenized_dataset = dataset.map(preprocess_fuction, batched=True)

後処理（パイプライン関数を使用しない場合）

パイプライン関数を使用しない場合は、BおよびIラベル付きのトークンをフィルタリングする必要があります。各BおよびIは、キーフレーズにマージされます。最後に、キーフレーズをトリミングして、不要なスペースが削除されていることを確認する必要があります。

# Define post_process functions
def concat_tokens_by_tag(keyphrases):
    keyphrase_tokens = []
    for id, label in keyphrases:
        if label == "B":
            keyphrase_tokens.append([id])
        elif label == "I":
            if len(keyphrase_tokens) > 0:
                keyphrase_tokens[len(keyphrase_tokens) - 1].append(id)
    return keyphrase_tokens


def extract_keyphrases(example, predictions, tokenizer, index=0):
    keyphrases_list = [
        (id, idx2label[label])
        for id, label in zip(
            np.array(example["input_ids"]).squeeze().tolist(), predictions[index]
        )
        if idx2label[label] in ["B", "I"]
    ]

    processed_keyphrases = concat_tokens_by_tag(keyphrases_list)
    extracted_kps = tokenizer.batch_decode(
        processed_keyphrases,
        skip_special_tokens=True,
        clean_up_tokenization_spaces=True,
    )
    return np.unique([kp.strip() for kp in extracted_kps])

🔧 技術詳細

このモデルは、KBIRという事前学習モデルをベースにしており、OpenKPデータセットで微調整されています。KBIRは、Masked Language Modeling (MLM)、Keyphrase Boundary Infilling (KBI)、およびKeyphrase Replacement Classification (KRC)の複合損失を最適化するためのマルチタスク学習セットアップを利用しています。