keyphrase-generation-t5-small-inspec開源模型 - 免費提取科學論文摘要關鍵短語

首頁

Keyphrase Generation T5 Small Inspec

由ml6team開發

基於T5-small微調的關鍵短語生成模型，專為科學論文摘要設計，可提取存在和未出現的關鍵短語。

文本生成

Transformers

英語開源協議:MIT #科學論文關鍵短語生成 #T5微調模型 #英文文本分析

下載量 167

發布時間 : 4/27/2022

模型概述

該模型通過文本到文本生成方式直接從文檔中提取關鍵短語，輸出為分隔符連接的字符串。適用於快速理解文檔內容。

模型特點

領域專業化

在計算機與控制領域的科學論文摘要上表現優異

雙模式輸出

可生成文檔中存在（present）和未出現（absent）的關鍵短語

語義理解

通過Transformer架構捕捉長期語義依賴關係，優於傳統統計方法

模型能力

關鍵短語提取

關鍵短語生成

文本語義分析

使用案例

學術研究

論文摘要分析

自動提取科學論文的核心概念短語

F1@M達到0.317（存在關鍵短語）

文檔管理

文獻索引

為大量文檔自動生成索引標籤

相比人工標註顯著提升效率

🚀 🔑 關鍵短語生成模型：T5-small-inspec

關鍵短語提取是文本分析中的一項技術，用於從文檔中提取重要的關鍵短語。藉助這些關鍵短語，人們無需完整閱讀文本，就能快速輕鬆地理解其內容。該模型聚焦於關鍵短語生成，利用先進技術提升提取效率和準確性，為文本分析提供有力支持。

🚀 快速開始

關鍵短語提取是文本分析中的一項技術，可從文檔中提取重要的關鍵短語。有了這些關鍵短語，人們無需完整閱讀文本，就能快速輕鬆地理解其內容。最初，關鍵短語提取主要由人工標註人員完成，他們會詳細閱讀文本，然後寫下最重要的關鍵短語。但缺點是，如果處理大量文檔，這個過程會耗費大量時間⏳。

這時，人工智能🤖 就派上用場了。目前，使用統計和語言特徵的傳統機器學習方法在提取過程中被廣泛應用。現在，藉助深度學習，甚至可以比這些傳統方法更好地捕捉文本的語義含義。傳統方法關注文本中單詞的頻率、出現次數和順序，而這些神經方法可以捕捉文本中單詞的長期語義依賴和上下文。

✨ 主要特性

本關鍵短語生成模型具有很強的領域針對性，在科學論文摘要上表現出色。
能夠生成存在和不存在的關鍵短語。
對預訓練語言模型在命名實體識別（NER）、問答（QA）、關係提取（RE）、抽象摘要等任務上進行微調，取得了與當前最優方法相當的性能，表明學習關鍵短語的豐富表示確實有利於許多其他基礎NLP任務。

📦 安裝指南

此部分文檔未提供具體安裝命令，故跳過。

💻 使用示例

基礎用法

# Model parameters
from transformers import (
    Text2TextGenerationPipeline,
    AutoModelForSeq2SeqLM,
    AutoTokenizer,
)


class KeyphraseGenerationPipeline(Text2TextGenerationPipeline):
    def __init__(self, model, keyphrase_sep_token=";", *args, **kwargs):
        super().__init__(
            model=AutoModelForSeq2SeqLM.from_pretrained(model),
            tokenizer=AutoTokenizer.from_pretrained(model),
            *args,
            **kwargs
        )
        self.keyphrase_sep_token = keyphrase_sep_token

    def postprocess(self, model_outputs):
        results = super().postprocess(
            model_outputs=model_outputs
        )
        return [[keyphrase.strip() for keyphrase in result.get("generated_text").split(self.keyphrase_sep_token) if keyphrase != ""] for result in results]

# Load pipeline
model_name = "ml6team/keyphrase-generation-t5-small-inspec"
generator = KeyphraseGenerationPipeline(model=model_name)

text = """
Keyphrase extraction is a technique in text analysis where you extract the
important keyphrases from a document. Thanks to these keyphrases humans can
understand the content of a text very quickly and easily without reading it
completely. Keyphrase extraction was first done primarily by human annotators,
who read the text in detail and then wrote down the most important keyphrases.
The disadvantage is that if you work with a lot of documents, this process
can take a lot of time. 

Here is where Artificial Intelligence comes in. Currently, classical machine
learning methods, that use statistical and linguistic features, are widely used
for the extraction process. Now with deep learning, it is possible to capture
the semantic meaning of a text even better than these classical methods.
Classical methods look at the frequency, occurrence and order of words
in the text, whereas these neural approaches can capture long-term
semantic dependencies and context of words in a text.
""".replace("\n", " ")

keyphrases = generator(text)

print(keyphrases)

# Output
[['keyphrase extraction', 'text analysis', 'artificial intelligence', 'classical machine learning methods']]

📚 詳細文檔

模型描述

本模型以 T5-small模型為基礎模型，並在 Inspec數據集上進行微調。關鍵短語生成Transformer被微調為一個文本到文本的生成問題，以生成關鍵短語。結果是一個由所有關鍵短語用給定分隔符（即 “;”）分隔的連接字符串。這些模型能夠生成存在和不存在的關鍵短語。

預期用途與限制

🛑 限制

此關鍵短語生成模型非常針對特定領域，在科學論文摘要上表現很好。不建議將此模型用於其他領域，但你可以自由測試。
僅適用於英文文檔。
有時輸出可能沒有意義。

❓ 如何使用

上述代碼示例展示瞭如何使用該模型進行關鍵短語提取。

訓練數據集

Inspec 是一個關鍵短語提取/生成數據集，由1998年至2002年發表的2000篇來自計算機與控制以及信息技術科學領域的英文科學論文組成。關鍵短語由專業索引人員或編輯進行標註。

你可以在論文中找到更多信息。

訓練過程

訓練參數

參數	值
學習率	5e-5
輪數	50
提前停止耐心值	1

預處理

數據集中的文檔已經預處理成單詞列表和相應的關鍵短語。唯一需要做的是進行分詞，並將所有關鍵短語用特定的分隔符（;）連接成一個字符串。

from datasets import load_dataset
from transformers import AutoTokenizer

# Tokenizer
tokenizer = AutoTokenizer.from_pretrained("t5-small", add_prefix_space=True)

# Dataset parameters
dataset_full_name = "midas/inspec"
dataset_subset = "raw"
dataset_document_column = "document"

keyphrase_sep_token = ";"

def preprocess_keyphrases(text_ids, kp_list):
    kp_order_list = []
    kp_set = set(kp_list)
    text = tokenizer.decode(
        text_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True
    )
    text = text.lower()
    for kp in kp_set:
        kp = kp.strip()
        kp_index = text.find(kp.lower())
        kp_order_list.append((kp_index, kp))

    kp_order_list.sort()
    present_kp, absent_kp = [], []

    for kp_index, kp in kp_order_list:
        if kp_index < 0:
            absent_kp.append(kp)
        else:
            present_kp.append(kp)
    return present_kp, absent_kp


def preprocess_fuction(samples):
    processed_samples = {"input_ids": [], "attention_mask": [], "labels": []}
    for i, sample in enumerate(samples[dataset_document_column]):
        input_text = " ".join(sample)
        inputs = tokenizer(
            input_text,
            padding="max_length",
            truncation=True,
        )
        present_kp, absent_kp = preprocess_keyphrases(
            text_ids=inputs["input_ids"],
            kp_list=samples["extractive_keyphrases"][i]
            + samples["abstractive_keyphrases"][i],
        )
        keyphrases = present_kp
        keyphrases += absent_kp

        target_text = f" {keyphrase_sep_token} ".join(keyphrases)

        with tokenizer.as_target_tokenizer():
            targets = tokenizer(
                target_text, max_length=40, padding="max_length", truncation=True
            )
            targets["input_ids"] = [
                (t if t != tokenizer.pad_token_id else -100)
                for t in targets["input_ids"]
            ]
        for key in inputs.keys():
            processed_samples[key].append(inputs[key])
        processed_samples["labels"].append(targets["input_ids"])
    return processed_samples

# Load dataset
dataset = load_dataset(dataset_full_name, dataset_subset)
# Preprocess dataset
tokenized_dataset = dataset.map(preprocess_fuction, batched=True)

後處理

對於後處理，你需要根據關鍵短語分隔符拆分字符串。

def extract_keyphrases(examples):
    return [example.split(keyphrase_sep_token) for example in examples]

評估結果

傳統的評估方法是精確率、召回率和F1分數 @k,m，其中k表示前k個預測的關鍵短語，m表示預測關鍵短語的平均數量。在關鍵短語生成中，還會關注F1@O，其中O表示真實關鍵短語的數量。

該模型在Inspec測試集上取得了以下結果：

提取式關鍵短語

數據集	P@5	R@5	F1@5	P@10	R@10	F1@10	P@M	R@M	F1@M	P@O	R@O	F1@O
Inspec測試集	0.33	0.31	0.29	0.17	0.31	0.20	0.41	0.31	0.32	0.28	0.28	0.28

抽象式關鍵短語

數據集	P@5	R@5	F1@5	P@10	R@10	F1@10	P@M	R@M	F1@M	P@O	R@O	F1@O
Inspec測試集	0.05	0.09	0.06	0.03	0.09	0.04	0.08	0.09	0.07	0.06	0.06	0.06

🔧 技術細節

在這項工作中，我們探索瞭如何學習針對從文本文檔中學習關鍵短語豐富表示的特定任務語言模型。我們在判別和生成設置中試驗了不同的掩碼策略，用於預訓練Transformer語言模型（LMs）。在判別設置中，我們引入了一個新的預訓練目標 - 帶替換的關鍵短語邊界填充（KBIR），當使用KBIR預訓練的LM針對關鍵短語提取任務進行微調時，與當前最優方法相比，性能有了很大提升（F1值最多提高9.26分）。在生成設置中，我們為BART引入了一個新的預訓練設置 - KeyBART，它以CatSeq格式重現與輸入文本相關的關鍵短語，而不是去噪後的原始輸入。這也使得關鍵短語生成的性能比當前最優方法有所提升（F1@M最多提高4.33分）。此外，我們還在命名實體識別（NER）、問答（QA）、關係提取（RE）、抽象摘要等任務上對預訓練語言模型進行微調，並取得了與當前最優方法相當的性能，表明學習關鍵短語的豐富表示確實有利於許多其他基礎NLP任務。