keyphrase-extraction-distilbert-inspec開源模型 - 免費提取英文科學論文摘要關鍵詞

首頁

Keyphrase Extraction Distilbert Inspec

由ml6team開發

基於DistilBERT的英文關鍵詞提取模型，在科學論文摘要領域表現優異。

序列標註

Transformers

英語開源協議:MIT #英文關鍵詞抽取 #科學論文摘要 #DistilBERT微調

下載量 22.07k

發布時間 : 3/25/2022

模型概述

該模型通過微調DistilBERT實現關鍵詞組序列標註，能自動從文檔中提取重要關鍵詞組，適用於快速理解文本內容。

模型特點

領域專業化

針對科學論文摘要優化，在計算機與控制領域表現最佳

輕量級架構

基於DistilBERT的壓縮模型，保持性能的同時減少計算資源需求

序列標註方法

採用BIO標註方案精準捕捉關鍵詞組邊界

模型能力

英文關鍵詞提取

科學文獻分析

語義信息捕捉

使用案例

學術研究

論文摘要分析

自動提取科研論文的核心概念關鍵詞

F1@M達0.49

信息檢索

文檔索引構建

為大量文獻自動生成檢索關鍵詞

比人工標註效率提升90%

🚀 關鍵短語提取模型：distilbert-inspec

關鍵短語提取是文本分析中的一項技術，可從文檔中提取重要的關鍵短語。藉助這些關鍵短語，人們無需通讀文本，就能快速輕鬆地理解其內容。最初，關鍵短語提取主要由人工標註人員完成，他們會詳細閱讀文本，然後記錄下最重要的關鍵短語。但缺點是，如果處理大量文檔，這個過程會非常耗時 ⏳。

這時，人工智能 🤖 就派上用場了。目前，使用統計和語言特徵的傳統機器學習方法在提取過程中被廣泛應用。而現在，藉助深度學習，甚至可以比這些傳統方法更好地捕捉文本的語義。傳統方法關注文本中單詞的頻率、出現次數和順序，而這些基於神經網絡的方法可以捕捉文本中單詞的長期語義依賴和上下文信息。

🚀 快速開始

你可以按照以下步驟使用該關鍵短語提取模型：

from transformers import (
    TokenClassificationPipeline,
    AutoModelForTokenClassification,
    AutoTokenizer,
)
from transformers.pipelines import AggregationStrategy
import numpy as np

# Define keyphrase extraction pipeline
class KeyphraseExtractionPipeline(TokenClassificationPipeline):
    def __init__(self, model, *args, **kwargs):
        super().__init__(
            model=AutoModelForTokenClassification.from_pretrained(model),
            tokenizer=AutoTokenizer.from_pretrained(model),
            *args,
            **kwargs
        )

    def postprocess(self, all_outputs):
        results = super().postprocess(
            all_outputs=all_outputs,
            aggregation_strategy=AggregationStrategy.FIRST,
        )
        return np.unique([result.get("word").strip() for result in results])

# Load pipeline
model_name = "ml6team/keyphrase-extraction-distilbert-inspec"
extractor = KeyphraseExtractionPipeline(model=model_name)

# Inference
text = """
Keyphrase extraction is a technique in text analysis where you extract the
important keyphrases from a document. Thanks to these keyphrases humans can
understand the content of a text very quickly and easily without reading it
completely. Keyphrase extraction was first done primarily by human annotators,
who read the text in detail and then wrote down the most important keyphrases.
The disadvantage is that if you work with a lot of documents, this process
can take a lot of time. 

Here is where Artificial Intelligence comes in. Currently, classical machine
learning methods, that use statistical and linguistic features, are widely used
for the extraction process. Now with deep learning, it is possible to capture
the semantic meaning of a text even better than these classical methods.
Classical methods look at the frequency, occurrence and order of words
in the text, whereas these neural approaches can capture long-term
semantic dependencies and context of words in a text.
""".replace("\n", " ")

keyphrases = extractor(text)

print(keyphrases)

# Output
['artificial intelligence' 'classical machine learning' 'deep learning'
 'keyphrase extraction' 'linguistic features' 'statistical'
 'text analysis']

✨ 主要特性

基於深度學習：利用深度學習技術，能更好地捕捉文本的語義信息，相比傳統機器學習方法有更優的性能。
特定領域表現出色：在科學論文摘要的關鍵短語提取任務上表現優異。

📦 安裝指南

文檔未提及具體安裝命令，故跳過此部分。

💻 使用示例

基礎用法

from transformers import (
    TokenClassificationPipeline,
    AutoModelForTokenClassification,
    AutoTokenizer,
)
from transformers.pipelines import AggregationStrategy
import numpy as np

# Define keyphrase extraction pipeline
class KeyphraseExtractionPipeline(TokenClassificationPipeline):
    def __init__(self, model, *args, **kwargs):
        super().__init__(
            model=AutoModelForTokenClassification.from_pretrained(model),
            tokenizer=AutoTokenizer.from_pretrained(model),
            *args,
            **kwargs
        )

    def postprocess(self, all_outputs):
        results = super().postprocess(
            all_outputs=all_outputs,
            aggregation_strategy=AggregationStrategy.FIRST,
        )
        return np.unique([result.get("word").strip() for result in results])

# Load pipeline
model_name = "ml6team/keyphrase-extraction-distilbert-inspec"
extractor = KeyphraseExtractionPipeline(model=model_name)

# Inference
text = """
Keyphrase extraction is a technique in text analysis where you extract the
important keyphrases from a document. Thanks to these keyphrases humans can
understand the content of a text very quickly and easily without reading it
completely. Keyphrase extraction was first done primarily by human annotators,
who read the text in detail and then wrote down the most important keyphrases.
The disadvantage is that if you work with a lot of documents, this process
can take a lot of time. 

Here is where Artificial Intelligence comes in. Currently, classical machine
learning methods, that use statistical and linguistic features, are widely used
for the extraction process. Now with deep learning, it is possible to capture
the semantic meaning of a text even better than these classical methods.
Classical methods look at the frequency, occurrence and order of words
in the text, whereas these neural approaches can capture long-term
semantic dependencies and context of words in a text.
""".replace("\n", " ")

keyphrases = extractor(text)

print(keyphrases)

高級用法

文檔未提及高級用法的代碼示例，故跳過此部分。

📚 詳細文檔

📓 模型描述

該模型以 distilbert 為基礎模型，並在 Inspec 數據集上進行微調。

關鍵短語提取模型是經過微調的Transformer模型，將其作為一個標記分類問題，即對文檔中的每個單詞進行分類，判斷其是否為關鍵短語的一部分。

標籤	描述
B-KEY	關鍵短語的開頭
I-KEY	關鍵短語的內部
O	關鍵短語之外

相關文獻：

Kulkarni, Mayank, Debanjan Mahata, Ravneet Arora, and Rajarshi Bhowmik. "Learning Rich Representation of Keyphrases from Text." arXiv preprint arXiv:2112.08547 (2021).
Sahrawat, Dhruva, Debanjan Mahata, Haimin Zhang, Mayank Kulkarni, Agniv Sharma, Rakesh Gosangi, Amanda Stent, Yaman Kumar, Rajiv Ratn Shah, and Roger Zimmermann. "Keyphrase extraction as sequence labeling using contextualized embeddings." In European Conference on Information Retrieval, pp. 328-335. Springer, Cham, 2020.

✋ 預期用途與限制

🛑 限制

領域特定性：該關鍵短語提取模型具有很強的領域特定性，在科學論文摘要上表現出色，不建議用於其他領域，但你可以自行測試。
語言限制：僅適用於英文文檔。

📚 訓練數據集

Inspec 是一個關鍵短語提取/生成數據集，包含2000篇英文科學論文，這些論文來自計算機、控制和信息技術等科學領域，發表於1998年至2002年之間。關鍵短語由專業索引人員或編輯進行標註。

你可以在論文中找到更多信息。

👷‍♂️ 訓練過程

訓練參數

參數	值
學習率	1e-4
輪數	50
提前停止耐心值	3

預處理

數據集中的文檔已經預處理成單詞列表和相應的標籤。唯一需要做的是進行分詞，並重新調整標籤，使其與正確的子詞標記相對應。

from datasets import load_dataset
from transformers import AutoTokenizer

# Labels
label_list = ["B", "I", "O"]
lbl2idx = {"B": 0, "I": 1, "O": 2}
idx2label = {0: "B", 1: "I", 2: "O"}

# Tokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
max_length = 512

# Dataset parameters
dataset_full_name = "midas/inspec"
dataset_subset = "raw"
dataset_document_column = "document"
dataset_biotags_column = "doc_bio_tags"

def preprocess_fuction(all_samples_per_split):
    tokenized_samples = tokenizer.batch_encode_plus(
        all_samples_per_split[dataset_document_column],
        padding="max_length",
        truncation=True,
        is_split_into_words=True,
        max_length=max_length,
    )
    total_adjusted_labels = []
    for k in range(0, len(tokenized_samples["input_ids"])):
        prev_wid = -1
        word_ids_list = tokenized_samples.word_ids(batch_index=k)
        existing_label_ids = all_samples_per_split[dataset_biotags_column][k]
        i = -1
        adjusted_label_ids = []

        for wid in word_ids_list:
            if wid is None:
                adjusted_label_ids.append(lbl2idx["O"])
            elif wid != prev_wid:
                i = i + 1
                adjusted_label_ids.append(lbl2idx[existing_label_ids[i]])
                prev_wid = wid
            else:
                adjusted_label_ids.append(
                    lbl2idx[
                        f"{'I' if existing_label_ids[i] == 'B' else existing_label_ids[i]}"
                    ]
                )

        total_adjusted_labels.append(adjusted_label_ids)
    tokenized_samples["labels"] = total_adjusted_labels
    return tokenized_samples

# Load dataset
dataset = load_dataset(dataset_full_name, dataset_subset)

# Preprocess dataset
tokenized_dataset = dataset.map(preprocess_fuction, batched=True)

後處理（不使用管道函數）

如果你不使用管道函數，則必須過濾掉標記為 B 和 I 的標記。然後將每個 B 和 I 合併成一個關鍵短語。最後，需要去除關鍵短語中的不必要空格。

# Define post_process functions
def concat_tokens_by_tag(keyphrases):
    keyphrase_tokens = []
    for id, label in keyphrases:
        if label == "B":
            keyphrase_tokens.append([id])
        elif label == "I":
            if len(keyphrase_tokens) > 0:
                keyphrase_tokens[len(keyphrase_tokens) - 1].append(id)
    return keyphrase_tokens


def extract_keyphrases(example, predictions, tokenizer, index=0):
    keyphrases_list = [
        (id, idx2label[label])
        for id, label in zip(
            np.array(example["input_ids"]).squeeze().tolist(), predictions[index]
        )
        if idx2label[label] in ["B", "I"]
    ]

    processed_keyphrases = concat_tokens_by_tag(keyphrases_list)
    extracted_kps = tokenizer.batch_decode(
        processed_keyphrases,
        skip_special_tokens=True,
        clean_up_tokenization_spaces=True,
    )
    return np.unique([kp.strip() for kp in extracted_kps])

📝 評估結果

傳統的評估方法是精確率、召回率和 F1 分數 @k,m，其中 k 表示前 k 個預測的關鍵短語，m 表示預測的關鍵短語的平均數量。

該模型在 Inspec 測試集上取得了以下結果：

數據集	P@5	R@5	F1@5	P@10	R@10	F1@10	P@M	R@M	F1@M
Inspec 測試集	0.45	0.40	0.39	0.33	0.53	0.38	0.47	0.57	0.49

🔧 技術細節

該模型基於Transformer架構，通過微調distilbert模型在Inspec數據集上進行關鍵短語提取任務。在訓練過程中，採用了特定的預處理和後處理步驟，以確保模型能夠準確地識別關鍵短語。在預處理階段，對文檔進行分詞和標籤調整；在後處理階段，過濾和合並標記以生成最終的關鍵短語。同時，模型在訓練過程中使用了特定的訓練參數，如學習率、輪數和提前停止耐心值，以優化模型性能。