keyphrase-extraction-kbir-inspec開源關鍵詞提取模型

首頁

Keyphrase Extraction Kbir Inspec

由ml6team開發

基於KBIR預訓練模型在Inspec數據集上微調的關鍵詞提取模型，採用序列標註方法識別文本中的關鍵詞組。

序列標註

Transformers

英語開源協議:MIT #科學論文關鍵詞抽取 #KBIR預訓練架構 #序列標註技術

下載量 22.12k

發布時間 : 3/29/2022

模型概述

該模型通過Transformer架構將關鍵詞提取任務建模為詞元分類問題，能夠從英文科學論文摘要中準確提取關鍵術語。

模型特點

多任務預訓練框架

結合掩碼語言建模(MLM)、關鍵詞邊界填充(KBI)和關鍵詞替換分類(KRC)的聯合優化

序列標註方法

將關鍵詞提取轉化為BIO標籤序列預測問題，捕捉關鍵詞組的邊界信息

領域專業化

在計算機科學領域的Inspec論文數據集上微調，適合學術文本分析

模型能力

英文關鍵詞提取

學術文本語義分析

長距離上下文依賴捕捉

使用案例

學術研究

論文摘要關鍵詞自動化標引

自動提取科學論文摘要中的核心術語，替代人工標引

F1@M達0.564，比傳統方法效率提升顯著

信息檢索

學術文獻索引構建

為文獻數據庫生成標準化關鍵詞索引

🚀 關鍵短語提取模型：KBIR-inspec

本項目聚焦於關鍵短語提取技術，旨在從文本中精準提取重要的關鍵短語，助力用戶快速把握文本核心內容。藉助人工智能，尤其是深度學習技術，本模型在關鍵短語提取任務上表現出色，且對其他基礎NLP任務也有積極影響。

🚀 快速開始

關鍵短語提取是文本分析中的一項技術，可從文檔中提取重要的關鍵短語。藉助這些關鍵短語，人們無需通讀全文，就能快速輕鬆地理解文本內容。最初，關鍵短語提取主要由人工標註人員完成，他們詳細閱讀文本，然後寫下最重要的關鍵短語。但缺點是，如果處理大量文檔，這個過程會非常耗時。

這時人工智能就派上用場了。目前，使用統計和語言特徵的傳統機器學習方法在提取過程中被廣泛應用。而現在，藉助深度學習，甚至可以比這些傳統方法更好地捕捉文本的語義。傳統方法關注文本中單詞的頻率、出現次數和順序，而這些神經網絡方法可以捕捉文本中單詞的長期語義依賴和上下文。

✨ 主要特性

基於先進模型：使用 KBIR 作為基礎模型，並在 Inspec 數據集上進行微調。
多任務學習：KBIR 利用多任務學習設置，優化掩碼語言模型（MLM）、關鍵短語邊界填充（KBI）和關鍵短語替換分類（KRC）的組合損失。
性能提升：在判別式設置中，引入新的預訓練目標 - 關鍵短語邊界填充替換（KBIR），在關鍵短語提取任務上比現有技術有顯著性能提升（F1 最高提升 9.26 分）；在生成式設置中，引入新的 BART 預訓練設置 - KeyBART，在關鍵短語生成任務上比現有技術有性能提升（F1@M 最高提升 4.33 分）。
廣泛適用性：對預訓練語言模型在命名實體識別（NER）、問答（QA）、關係提取（RE）、抽象摘要等任務上進行微調，取得了與現有技術相當的性能，表明學習關鍵短語的豐富表示確實對許多其他基礎 NLP 任務有益。

📦 安裝指南

此部分文檔未提供具體安裝命令，故跳過。

💻 使用示例

基礎用法

from transformers import (
    TokenClassificationPipeline,
    AutoModelForTokenClassification,
    AutoTokenizer,
)
from transformers.pipelines import AggregationStrategy
import numpy as np

# Define keyphrase extraction pipeline
class KeyphraseExtractionPipeline(TokenClassificationPipeline):
    def __init__(self, model, *args, **kwargs):
        super().__init__(
            model=AutoModelForTokenClassification.from_pretrained(model),
            tokenizer=AutoTokenizer.from_pretrained(model),
            *args,
            **kwargs
        )

    def postprocess(self, all_outputs):
        results = super().postprocess(
            all_outputs=all_outputs,
            aggregation_strategy=AggregationStrategy.SIMPLE,
        )
        return np.unique([result.get("word").strip() for result in results])

高級用法

# Load pipeline
model_name = "ml6team/keyphrase-extraction-kbir-inspec"
extractor = KeyphraseExtractionPipeline(model=model_name)

# Inference
text = """
Keyphrase extraction is a technique in text analysis where you extract the
important keyphrases from a document. Thanks to these keyphrases humans can
understand the content of a text very quickly and easily without reading it
completely. Keyphrase extraction was first done primarily by human annotators,
who read the text in detail and then wrote down the most important keyphrases.
The disadvantage is that if you work with a lot of documents, this process
can take a lot of time. 

Here is where Artificial Intelligence comes in. Currently, classical machine
learning methods, that use statistical and linguistic features, are widely used
for the extraction process. Now with deep learning, it is possible to capture
the semantic meaning of a text even better than these classical methods.
Classical methods look at the frequency, occurrence and order of words
in the text, whereas these neural approaches can capture long-term
semantic dependencies and context of words in a text.
""".replace("\n", " ")

keyphrases = extractor(text)

print(keyphrases)

# Output
['Artificial Intelligence' 'Keyphrase extraction' 'deep learning'
 'linguistic features' 'machine learning' 'semantic meaning'
 'text analysis']

📚 詳細文檔

模型描述

本模型使用 KBIR 作為基礎模型，並在 Inspec 數據集上進行微調。KBIR 即關鍵短語邊界填充替換，是一個預訓練模型，它利用多任務學習設置來優化掩碼語言模型（MLM）、關鍵短語邊界填充（KBI）和關鍵短語替換分類（KRC）的組合損失。

你可以在這篇論文中找到有關該架構的更多信息。

關鍵短語提取模型是經過微調的變壓器模型，將其作為一個令牌分類問題，其中文檔中的每個單詞被分類為是否屬於關鍵短語。

標籤	詳情
B-KEY	關鍵短語的開頭
I-KEY	關鍵短語內部
O	關鍵短語外部

引用文獻：

Kulkarni, Mayank, Debanjan Mahata, Ravneet Arora, and Rajarshi Bhowmik. "Learning Rich Representation of Keyphrases from Text." arXiv preprint arXiv:2112.08547 (2021).
Sahrawat, Dhruva, Debanjan Mahata, Haimin Zhang, Mayank Kulkarni, Agniv Sharma, Rakesh Gosangi, Amanda Stent, Yaman Kumar, Rajiv Ratn Shah, and Roger Zimmermann. "Keyphrase extraction as sequence labeling using contextualized embeddings." In European Conference on Information Retrieval, pp. 328 - 335. Springer, Cham, 2020.

預期用途與侷限性

🛑 侷限性

此關鍵短語提取模型具有很強的領域特定性，在科學論文摘要上表現出色。不建議將此模型用於其他領域，但你可以自由測試。
僅適用於英文文檔。

❓ 如何使用

上述使用示例部分已詳細展示使用方法。

訓練數據集

Inspec 是一個關鍵短語提取/生成數據集，由 1998 年至 2002 年發表的 2000 篇來自計算機與控制以及信息技術科學領域的英文科學論文組成。關鍵短語由專業索引人員或編輯進行標註。

你可以在論文中找到更多信息。

訓練過程

訓練參數

參數	值
學習率	1e - 4
輪數	50
早停耐心值	3

預處理

數據集中的文檔已經預處理為帶有相應標籤的單詞列表。唯一需要做的是進行分詞，並重新調整標籤，使其與正確的子詞令牌相對應。

from datasets import load_dataset
from transformers import AutoTokenizer

# Labels
label_list = ["B", "I", "O"]
lbl2idx = {"B": 0, "I": 1, "O": 2}
idx2label = {0: "B", 1: "I", 2: "O"}

# Tokenizer
tokenizer = AutoTokenizer.from_pretrained("bloomberg/KBIR", add_prefix_space=True)
max_length = 512

# Dataset parameters
dataset_full_name = "midas/inspec"
dataset_subset = "raw"
dataset_document_column = "document"
dataset_biotags_column = "doc_bio_tags"

def preprocess_fuction(all_samples_per_split):
    tokenized_samples = tokenizer.batch_encode_plus(
        all_samples_per_split[dataset_document_column],
        padding="max_length",
        truncation=True,
        is_split_into_words=True,
        max_length=max_length,
    )
    total_adjusted_labels = []
    for k in range(0, len(tokenized_samples["input_ids"])):
        prev_wid = -1
        word_ids_list = tokenized_samples.word_ids(batch_index=k)
        existing_label_ids = all_samples_per_split[dataset_biotags_column][k]
        i = -1
        adjusted_label_ids = []

        for wid in word_ids_list:
            if wid is None:
                adjusted_label_ids.append(lbl2idx["O"])
            elif wid != prev_wid:
                i = i + 1
                adjusted_label_ids.append(lbl2idx[existing_label_ids[i]])
                prev_wid = wid
            else:
                adjusted_label_ids.append(
                    lbl2idx[
                        f"{'I' if existing_label_ids[i] == 'B' else existing_label_ids[i]}"
                    ]
                )

        total_adjusted_labels.append(adjusted_label_ids)
    tokenized_samples["labels"] = total_adjusted_labels
    return tokenized_samples

# Load dataset
dataset = load_dataset(dataset_full_name, dataset_subset)

# Preprocess dataset
tokenized_dataset = dataset.map(preprocess_fuction, batched=True)

後處理（不使用管道函數）

如果你不使用管道函數，則必須過濾掉帶有 B 和 I 標籤的令牌。然後將每個 B 和 I 合併成一個關鍵短語。最後，你需要去除關鍵短語中的多餘空格。

# Define post_process functions
def concat_tokens_by_tag(keyphrases):
    keyphrase_tokens = []
    for id, label in keyphrases:
        if label == "B":
            keyphrase_tokens.append([id])
        elif label == "I":
            if len(keyphrase_tokens) > 0:
                keyphrase_tokens[len(keyphrase_tokens) - 1].append(id)
    return keyphrase_tokens


def extract_keyphrases(example, predictions, tokenizer, index=0):
    keyphrases_list = [
        (id, idx2label[label])
        for id, label in zip(
            np.array(example["input_ids"]).squeeze().tolist(), predictions[index]
        )
        if idx2label[label] in ["B", "I"]
    ]

    processed_keyphrases = concat_tokens_by_tag(keyphrases_list)
    extracted_kps = tokenizer.batch_decode(
        processed_keyphrases,
        skip_special_tokens=True,
        clean_up_tokenization_spaces=True,
    )
    return np.unique([kp.strip() for kp in extracted_kps])

評估結果

傳統的評估方法是精確率、召回率和 F1 分數 @k,m，其中 k 表示前 k 個預測的關鍵短語，m 表示預測的關鍵短語的平均數量。

該模型在 Inspec 測試集上取得了以下結果：

數據集	P@5	R@5	F1@5	P@10	R@10	F1@10	P@M	R@M	F1@M
Inspec 測試集	0.53	0.47	0.46	0.36	0.58	0.41	0.58	0.60	0.56

🔧 技術細節

本模型基於 KBIR 模型進行微調，在預訓練階段採用了新的目標和設置，如關鍵短語邊界填充替換（KBIR）和 KeyBART，以提升關鍵短語提取和生成的性能。在訓練過程中，對數據集進行了預處理和後處理，確保標籤與子詞令牌對應，並對提取的關鍵短語進行清理。同時，在多個 NLP 任務上進行微調，驗證了學習關鍵短語豐富表示的有效性。