keyphrase-extraction-kbir-openkp開源模型 - 從英文文本中精準提取重要關鍵短語

首頁

Keyphrase Extraction Kbir Openkp

由ml6team開發

基於KBIR架構的關鍵詞抽取模型，在OpenKP數據集上微調，用於從英文文本中提取重要關鍵短語

序列標註

Transformers

英語開源協議:MIT #多任務學習框架 #網頁關鍵短語抽取 #語義邊界建模

下載量 90

發布時間 : 6/16/2022

模型概述

該模型將關鍵詞抽取任務轉化為詞元分類問題，通過判斷每個詞是否屬於關鍵短語的起始(B-KEY)、內部(I-KEY)或外部(O)來提取關鍵短語

模型特點

多任務學習框架

聯合優化掩碼語言建模(MLM)、關鍵短語邊界填充(KBI)和關鍵短語替換分類(KRC)的損失函數

語義理解

相比傳統基於詞頻的方法，能更好地捕捉文本的長期語義依賴關係和上下文

高效標註

自動化關鍵短語抽取，顯著減少人工標註海量文檔的時間成本

模型能力

英文關鍵詞抽取

語義關鍵短語識別

文檔內容摘要生成

使用案例

文本分析

文檔快速理解

通過提取的關鍵短語快速掌握文檔核心內容，無需完整閱讀

提升信息檢索效率

內容索引構建

為大規模文檔集合自動生成關鍵詞索引

優化搜索引擎效果

知識管理

學術文獻分析

從研究論文中提取核心概念和術語

加速文獻綜述過程

🚀 🔑 關鍵短語提取模型：KBIR - OpenKP

關鍵短語提取是文本分析中的一項技術，用於從文檔中提取重要的關鍵短語。藉助這些關鍵短語，人們無需通讀全文，就能快速輕鬆地理解文本內容。最初，關鍵短語提取主要由人工標註員完成，他們會詳細閱讀文本，然後寫下最重要的關鍵短語。但缺點是，如果處理大量文檔，這個過程會非常耗時 ⏳。

這時，人工智能 🤖 就派上用場了。目前，使用統計和語言特徵的傳統機器學習方法在提取過程中被廣泛應用。而現在，藉助深度學習，能夠比這些傳統方法更好地捕捉文本的語義。傳統方法關注文本中單詞的頻率、出現次數和順序，而這些神經方法可以捕捉文本中單詞的長期語義依賴和上下文信息。

🚀 快速開始

安裝依賴

from transformers import (
    TokenClassificationPipeline,
    AutoModelForTokenClassification,
    AutoTokenizer,
)
from transformers.pipelines import AggregationStrategy
import numpy as np

定義關鍵短語提取管道

# Define keyphrase extraction pipeline
class KeyphraseExtractionPipeline(TokenClassificationPipeline):
    def __init__(self, model, *args, **kwargs):
        super().__init__(
            model=AutoModelForTokenClassification.from_pretrained(model),
            tokenizer=AutoTokenizer.from_pretrained(model),
            *args,
            **kwargs
        )

    def postprocess(self, all_outputs):
        results = super().postprocess(
            all_outputs=all_outputs,
            aggregation_strategy=AggregationStrategy.SIMPLE,
        )
        return np.unique([result.get("word").strip() for result in results])

加載管道

# Load pipeline
model_name = "ml6team/keyphrase-extraction-kbir-openkp"
extractor = KeyphraseExtractionPipeline(model=model_name)

進行推理

# Inference
text = """
Keyphrase extraction is a technique in text analysis where you extract the
important keyphrases from a document. Thanks to these keyphrases humans can
understand the content of a text very quickly and easily without reading it
completely. Keyphrase extraction was first done primarily by human annotators,
who read the text in detail and then wrote down the most important keyphrases.
The disadvantage is that if you work with a lot of documents, this process
can take a lot of time. 

Here is where Artificial Intelligence comes in. Currently, classical machine
learning methods, that use statistical and linguistic features, are widely used
for the extraction process. Now with deep learning, it is possible to capture
the semantic meaning of a text even better than these classical methods.
Classical methods look at the frequency, occurrence and order of words
in the text, whereas these neural approaches can capture long-term
semantic dependencies and context of words in a text.
""".replace("\n", " ")

keyphrases = extractor(text)

print(keyphrases)

輸出結果

# Output
['keyphrase extraction' 'text analysis']

✨ 主要特性

以 KBIR 為基礎模型，並在 OpenKP 數據集上進行微調。
利用多任務學習設置優化組合損失，包括掩碼語言模型（MLM）、關鍵短語邊界填充（KBI）和關鍵短語替換分類（KRC）。
將關鍵短語提取問題轉化為標記分類問題，對文檔中的每個單詞進行分類，判斷其是否為關鍵短語的一部分。

📦 安裝指南

本模型基於 Python 和 transformers 庫，你可以使用以下命令安裝所需依賴：

pip install transformers datasets numpy

💻 使用示例

基礎用法

from transformers import (
    TokenClassificationPipeline,
    AutoModelForTokenClassification,
    AutoTokenizer,
)
from transformers.pipelines import AggregationStrategy
import numpy as np

# Define keyphrase extraction pipeline
class KeyphraseExtractionPipeline(TokenClassificationPipeline):
    def __init__(self, model, *args, **kwargs):
        super().__init__(
            model=AutoModelForTokenClassification.from_pretrained(model),
            tokenizer=AutoTokenizer.from_pretrained(model),
            *args,
            **kwargs
        )

    def postprocess(self, all_outputs):
        results = super().postprocess(
            all_outputs=all_outputs,
            aggregation_strategy=AggregationStrategy.SIMPLE,
        )
        return np.unique([result.get("word").strip() for result in results])

# Load pipeline
model_name = "ml6team/keyphrase-extraction-kbir-openkp"
extractor = KeyphraseExtractionPipeline(model=model_name)

# Inference
text = """
Keyphrase extraction is a technique in text analysis where you extract the
important keyphrases from a document. Thanks to these keyphrases humans can
understand the content of a text very quickly and easily without reading it
completely. Keyphrase extraction was first done primarily by human annotators,
who read the text in detail and then wrote down the most important keyphrases.
The disadvantage is that if you work with a lot of documents, this process
can take a lot of time. 

Here is where Artificial Intelligence comes in. Currently, classical machine
learning methods, that use statistical and linguistic features, are widely used
for the extraction process. Now with deep learning, it is possible to capture
the semantic meaning of a text even better than these classical methods.
Classical methods look at the frequency, occurrence and order of words
in the text, whereas these neural approaches can capture long-term
semantic dependencies and context of words in a text.
""".replace("\n", " ")

keyphrases = extractor(text)

print(keyphrases)

高級用法

如果你不使用管道函數，需要手動過濾出標記為 B 和 I 的標記，並將它們合併成關鍵短語，最後去除不必要的空格。

# Labels
label_list = ["B", "I", "O"]
lbl2idx = {"B": 0, "I": 1, "O": 2}
idx2label = {0: "B", 1: "I", 2: "O"}

# Define post_process functions
def concat_tokens_by_tag(keyphrases):
    keyphrase_tokens = []
    for id, label in keyphrases:
        if label == "B":
            keyphrase_tokens.append([id])
        elif label == "I":
            if len(keyphrase_tokens) > 0:
                keyphrase_tokens[len(keyphrase_tokens) - 1].append(id)
    return keyphrase_tokens


def extract_keyphrases(example, predictions, tokenizer, index=0):
    keyphrases_list = [
        (id, idx2label[label])
        for id, label in zip(
            np.array(example["input_ids"]).squeeze().tolist(), predictions[index]
        )
        if idx2label[label] in ["B", "I"]
    ]

    processed_keyphrases = concat_tokens_by_tag(keyphrases_list)
    extracted_kps = tokenizer.batch_decode(
        processed_keyphrases,
        skip_special_tokens=True,
        clean_up_tokenization_spaces=True,
    )
    return np.unique([kp.strip() for kp in extracted_kps])

📚 詳細文檔

📓 模型描述

本模型使用 KBIR 作為基礎模型，並在 OpenKP 數據集上進行微調。KBIR 即關鍵短語邊界填充與替換，是一個預訓練模型，它利用多任務學習設置來優化掩碼語言模型（MLM）、關鍵短語邊界填充（KBI）和關鍵短語替換分類（KRC）的組合損失。

你可以在這篇論文中找到關於該架構的更多信息。

關鍵短語提取模型是經過微調的變壓器模型，將其作為標記分類問題處理，對文檔中的每個單詞進行分類，判斷其是否為關鍵短語的一部分。

標籤	描述
B - KEY	關鍵短語的開頭
I - KEY	關鍵短語內部
O	關鍵短語外部

👷‍♂️ 訓練過程

訓練參數

參數	值
學習率	1e - 4
訓練輪數	50
提前停止耐心值	3

預處理

數據集中的文檔已經預處理成單詞列表和相應的標籤。唯一需要做的是進行標記化，並重新調整標籤，使其與正確的子詞標記相對應。

from datasets import load_dataset
from transformers import AutoTokenizer

# Labels
label_list = ["B", "I", "O"]
lbl2idx = {"B": 0, "I": 1, "O": 2}
idx2label = {0: "B", 1: "I", 2: "O"}

# Tokenizer
tokenizer = AutoTokenizer.from_pretrained("bloomberg/KBIR")
max_length = 512

# Dataset parameters
dataset_full_name = "midas/openkp"
dataset_subset = "raw"
dataset_document_column = "document"
dataset_biotags_column = "doc_bio_tags"

def preprocess_fuction(all_samples_per_split):
    tokenized_samples = tokenizer.batch_encode_plus(
        all_samples_per_split[dataset_document_column],
        padding="max_length",
        truncation=True,
        is_split_into_words=True,
        max_length=max_length,
    )
    total_adjusted_labels = []
    for k in range(0, len(tokenized_samples["input_ids"])):
        prev_wid = -1
        word_ids_list = tokenized_samples.word_ids(batch_index=k)
        existing_label_ids = all_samples_per_split[dataset_biotags_column][k]
        i = -1
        adjusted_label_ids = []

        for wid in word_ids_list:
            if wid is None:
                adjusted_label_ids.append(lbl2idx["O"])
            elif wid != prev_wid:
                i = i + 1
                adjusted_label_ids.append(lbl2idx[existing_label_ids[i]])
                prev_wid = wid
            else:
                adjusted_label_ids.append(
                    lbl2idx[
                        f"{'I' if existing_label_ids[i] == 'B' else existing_label_ids[i]}"
                    ]
                )

        total_adjusted_labels.append(adjusted_label_ids)
    tokenized_samples["labels"] = total_adjusted_labels
    return tokenized_samples

# Load dataset
dataset = load_dataset(dataset_full_name, dataset_subset)

# Preprocess dataset
tokenized_dataset = dataset.map(preprocess_fuction, batched=True)

後處理（不使用管道函數）

如果你不使用管道函數，必須過濾出標記為 B 和 I 的標記。然後將每個 B 和 I 合併成一個關鍵短語。最後，需要去除關鍵短語中的不必要空格。

# Define post_process functions
def concat_tokens_by_tag(keyphrases):
    keyphrase_tokens = []
    for id, label in keyphrases:
        if label == "B":
            keyphrase_tokens.append([id])
        elif label == "I":
            if len(keyphrase_tokens) > 0:
                keyphrase_tokens[len(keyphrase_tokens) - 1].append(id)
    return keyphrase_tokens


def extract_keyphrases(example, predictions, tokenizer, index=0):
    keyphrases_list = [
        (id, idx2label[label])
        for id, label in zip(
            np.array(example["input_ids"]).squeeze().tolist(), predictions[index]
        )
        if idx2label[label] in ["B", "I"]
    ]

    processed_keyphrases = concat_tokens_by_tag(keyphrases_list)
    extracted_kps = tokenizer.batch_decode(
        processed_keyphrases,
        skip_special_tokens=True,
        clean_up_tokenization_spaces=True,
    )
    return np.unique([kp.strip() for kp in extracted_kps])