keyphrase-extraction-distilbert-openkp開源英文關鍵詞提取模型

首頁

Keyphrase Extraction Distilbert Openkp

由ml6team開發

基於DistilBERT架構的英文關鍵詞提取模型，在OpenKP數據集上微調，用於自動識別文本中的關鍵短語。

序列標註

Transformers

英語開源協議:MIT #英文關鍵詞提取 #網頁內容分析 #序列標註模型

下載量 32

發布時間 : 3/25/2022

模型概述

該模型通過分析文本內容自動提取重要關鍵詞短語，幫助用戶快速理解文檔核心內容，無需完整閱讀全文。適用於文檔摘要、信息檢索等場景。

模型特點

高效關鍵詞提取

能夠快速準確地從文本中提取關鍵短語，顯著提升文檔處理效率。

深度學習支持

採用神經網絡架構，相比傳統方法能更好地捕捉文本語義信息和上下文關聯。

輕量級模型

基於DistilBERT架構，在保持性能的同時降低了計算資源需求。

模型能力

自動關鍵詞提取

文本語義分析

文檔內容摘要

使用案例

信息處理

文檔摘要生成

自動提取文檔關鍵信息生成簡潔摘要

幫助用戶快速掌握文檔核心內容

搜索引擎優化

為網頁內容提取關鍵詞用於SEO優化

提升網頁在搜索結果中的相關性排名

內容分析

新聞熱點分析

從新聞文章中提取關鍵詞識別熱點話題

輔助媒體監測和趨勢分析

🚀 關鍵短語提取模型：distilbert - openkp

關鍵短語提取是文本分析中的一項技術，用於從文檔中提取重要的關鍵短語。藉助這些關鍵短語，人們無需通讀全文，就能快速輕鬆地理解文本內容。最初，關鍵短語提取主要由人工標註人員完成，他們仔細閱讀文本，然後寫下最重要的關鍵短語。但缺點是，如果處理大量文檔，這個過程會耗費大量時間 ⏳。

這時，人工智能 🤖 就派上用場了。目前，使用統計和語言特徵的傳統機器學習方法在提取過程中被廣泛應用。而現在，藉助深度學習，甚至可以比這些傳統方法更好地捕捉文本的語義。傳統方法關注文本中單詞的頻率、出現次數和順序，而這些神經方法可以捕捉文本中單詞的長期語義依賴和上下文。

🚀 快速開始

模型使用

from transformers import (
    TokenClassificationPipeline,
    AutoModelForTokenClassification,
    AutoTokenizer,
)
from transformers.pipelines import AggregationStrategy
import numpy as np

# Define keyphrase extraction pipeline
class KeyphraseExtractionPipeline(TokenClassificationPipeline):
    def __init__(self, model, *args, **kwargs):
        super().__init__(
            model=AutoModelForTokenClassification.from_pretrained(model),
            tokenizer=AutoTokenizer.from_pretrained(model),
            *args,
            **kwargs
        )

    def postprocess(self, all_outputs):
        results = super().postprocess(
            all_outputs=all_outputs,
            aggregation_strategy=AggregationStrategy.FIRST,
        )
        return np.unique([result.get("word").strip() for result in results])

# Load pipeline
model_name = "ml6team/keyphrase-extraction-distilbert-openkp"
extractor = KeyphraseExtractionPipeline(model=model_name)

# Inference
text = """
Keyphrase extraction is a technique in text analysis where you extract the
important keyphrases from a document. Thanks to these keyphrases humans can
understand the content of a text very quickly and easily without reading it
completely. Keyphrase extraction was first done primarily by human annotators,
who read the text in detail and then wrote down the most important keyphrases.
The disadvantage is that if you work with a lot of documents, this process
can take a lot of time. 

Here is where Artificial Intelligence comes in. Currently, classical machine
learning methods, that use statistical and linguistic features, are widely used
for the extraction process. Now with deep learning, it is possible to capture
the semantic meaning of a text even better than these classical methods.
Classical methods look at the frequency, occurrence and order of words
in the text, whereas these neural approaches can capture long-term
semantic dependencies and context of words in a text.
""".replace("\n", " ")

keyphrases = extractor(text)

print(keyphrases)

# Output
['keyphrase extraction' 'text analysis']

✨ 主要特性

基於深度學習：利用深度學習技術，能更好地捕捉文本的語義信息，相比傳統機器學習方法更具優勢。
Transformer架構：作為一種Transformer模型，通過微調解決令牌分類問題，對文檔中的每個單詞進行分類，判斷其是否為關鍵短語的一部分。

📦 安裝指南

文檔中未提及具體安裝步驟，可參考transformers庫的官方安裝指南進行安裝。

📚 詳細文檔

📓 模型描述

此模型以 KBIR 為基礎模型，並在 OpenKP 數據集上進行微調。

關鍵短語提取模型是經過微調的Transformer模型，將其作為令牌分類問題處理，即對文檔中的每個單詞進行分類，判斷其是否為關鍵短語的一部分。

標籤	描述
B - KEY	關鍵短語的開頭
I - KEY	關鍵短語內部
O	關鍵短語外部

✋ 預期用途與侷限性

🛑 侷限性

預測的關鍵短語數量有限。
僅適用於英文文檔。

❓ 使用方法

上述快速開始部分已給出使用示例。

📚 訓練數據集

OpenKP 是一個大規模、開放領域的關鍵短語提取數據集，包含148,124個真實世界的網頁文檔以及1 - 3個由人工標註的最相關關鍵短語。

你可以在論文中找到更多信息。

👷‍♂️ 訓練過程

訓練參數

參數	值
學習率	1e - 4
輪數	50
早停耐心值	3

預處理

數據集中的文檔已預處理為單詞列表及相應的標籤。唯一需要做的是進行標記化，並重新調整標籤，使其與正確的子詞標記相對應。

from datasets import load_dataset
from transformers import AutoTokenizer

# Labels
label_list = ["B", "I", "O"]
lbl2idx = {"B": 0, "I": 1, "O": 2}
idx2label = {0: "B", 1: "I", 2: "O"}

# Tokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
max_length = 512

# Dataset parameters
dataset_full_name = "midas/openkp"
dataset_subset = "raw"
dataset_document_column = "document"
dataset_biotags_column = "doc_bio_tags"

def preprocess_fuction(all_samples_per_split):
    tokenized_samples = tokenizer.batch_encode_plus(
        all_samples_per_split[dataset_document_column],
        padding="max_length",
        truncation=True,
        is_split_into_words=True,
        max_length=max_length,
    )
    total_adjusted_labels = []
    for k in range(0, len(tokenized_samples["input_ids"])):
        prev_wid = -1
        word_ids_list = tokenized_samples.word_ids(batch_index=k)
        existing_label_ids = all_samples_per_split[dataset_biotags_column][k]
        i = -1
        adjusted_label_ids = []

        for wid in word_ids_list:
            if wid is None:
                adjusted_label_ids.append(lbl2idx["O"])
            elif wid != prev_wid:
                i = i + 1
                adjusted_label_ids.append(lbl2idx[existing_label_ids[i]])
                prev_wid = wid
            else:
                adjusted_label_ids.append(
                    lbl2idx[
                        f"{'I' if existing_label_ids[i] == 'B' else existing_label_ids[i]}"
                    ]
                )

        total_adjusted_labels.append(adjusted_label_ids)
    tokenized_samples["labels"] = total_adjusted_labels
    return tokenized_samples

# Load dataset
dataset = load_dataset(dataset_full_name, dataset_subset)

# Preprocess dataset
tokenized_dataset = dataset.map(preprocess_fuction, batched=True)

後處理（不使用管道函數）

如果不使用管道函數，必須過濾掉標記為B和I的令牌。然後將每個B和I合併為一個關鍵短語。最後，需要去除關鍵短語中的多餘空格。

# Define post_process functions
def concat_tokens_by_tag(keyphrases):
    keyphrase_tokens = []
    for id, label in keyphrases:
        if label == "B":
            keyphrase_tokens.append([id])
        elif label == "I":
            if len(keyphrase_tokens) > 0:
                keyphrase_tokens[len(keyphrase_tokens) - 1].append(id)
    return keyphrase_tokens


def extract_keyphrases(example, predictions, tokenizer, index=0):
    keyphrases_list = [
        (id, idx2label[label])
        for id, label in zip(
            np.array(example["input_ids"]).squeeze().tolist(), predictions[index]
        )
        if idx2label[label] in ["B", "I"]
    ]

    processed_keyphrases = concat_tokens_by_tag(keyphrases_list)
    extracted_kps = tokenizer.batch_decode(
        processed_keyphrases,
        skip_special_tokens=True,
        clean_up_tokenization_spaces=True,
    )
    return np.unique([kp.strip() for kp in extracted_kps])