keyphrase-generation-t5-small-inspec开源模型 - 免费提取科学论文摘要关键短语

首页

Keyphrase Generation T5 Small Inspec

由 ml6team 开发

基于T5-small微调的关键短语生成模型，专为科学论文摘要设计，可提取存在和未出现的关键短语。

文本生成

Transformers

英语开源协议:MIT #科学论文关键短语生成 #T5微调模型 #英文文本分析

下载量 167

发布时间 : 4/27/2022

模型简介

该模型通过文本到文本生成方式直接从文档中提取关键短语，输出为分隔符连接的字符串。适用于快速理解文档内容。

模型特点

领域专业化

在计算机与控制领域的科学论文摘要上表现优异

双模式输出

可生成文档中存在（present）和未出现（absent）的关键短语

语义理解

通过Transformer架构捕捉长期语义依赖关系，优于传统统计方法

模型能力

关键短语提取

关键短语生成

文本语义分析

使用案例

学术研究

论文摘要分析

自动提取科学论文的核心概念短语

F1@M达到0.317（存在关键短语）

文档管理

文献索引

为大量文档自动生成索引标签

相比人工标注显著提升效率

🚀 🔑 关键短语生成模型：T5-small-inspec

关键短语提取是文本分析中的一项技术，用于从文档中提取重要的关键短语。借助这些关键短语，人们无需完整阅读文本，就能快速轻松地理解其内容。该模型聚焦于关键短语生成，利用先进技术提升提取效率和准确性，为文本分析提供有力支持。

🚀 快速开始

关键短语提取是文本分析中的一项技术，可从文档中提取重要的关键短语。有了这些关键短语，人们无需完整阅读文本，就能快速轻松地理解其内容。最初，关键短语提取主要由人工标注人员完成，他们会详细阅读文本，然后写下最重要的关键短语。但缺点是，如果处理大量文档，这个过程会耗费大量时间⏳。

这时，人工智能🤖 就派上用场了。目前，使用统计和语言特征的传统机器学习方法在提取过程中被广泛应用。现在，借助深度学习，甚至可以比这些传统方法更好地捕捉文本的语义含义。传统方法关注文本中单词的频率、出现次数和顺序，而这些神经方法可以捕捉文本中单词的长期语义依赖和上下文。

✨ 主要特性

本关键短语生成模型具有很强的领域针对性，在科学论文摘要上表现出色。
能够生成存在和不存在的关键短语。
对预训练语言模型在命名实体识别（NER）、问答（QA）、关系提取（RE）、抽象摘要等任务上进行微调，取得了与当前最优方法相当的性能，表明学习关键短语的丰富表示确实有利于许多其他基础NLP任务。

📦 安装指南

此部分文档未提供具体安装命令，故跳过。

💻 使用示例

基础用法

# Model parameters
from transformers import (
    Text2TextGenerationPipeline,
    AutoModelForSeq2SeqLM,
    AutoTokenizer,
)


class KeyphraseGenerationPipeline(Text2TextGenerationPipeline):
    def __init__(self, model, keyphrase_sep_token=";", *args, **kwargs):
        super().__init__(
            model=AutoModelForSeq2SeqLM.from_pretrained(model),
            tokenizer=AutoTokenizer.from_pretrained(model),
            *args,
            **kwargs
        )
        self.keyphrase_sep_token = keyphrase_sep_token

    def postprocess(self, model_outputs):
        results = super().postprocess(
            model_outputs=model_outputs
        )
        return [[keyphrase.strip() for keyphrase in result.get("generated_text").split(self.keyphrase_sep_token) if keyphrase != ""] for result in results]

# Load pipeline
model_name = "ml6team/keyphrase-generation-t5-small-inspec"
generator = KeyphraseGenerationPipeline(model=model_name)

text = """
Keyphrase extraction is a technique in text analysis where you extract the
important keyphrases from a document. Thanks to these keyphrases humans can
understand the content of a text very quickly and easily without reading it
completely. Keyphrase extraction was first done primarily by human annotators,
who read the text in detail and then wrote down the most important keyphrases.
The disadvantage is that if you work with a lot of documents, this process
can take a lot of time. 

Here is where Artificial Intelligence comes in. Currently, classical machine
learning methods, that use statistical and linguistic features, are widely used
for the extraction process. Now with deep learning, it is possible to capture
the semantic meaning of a text even better than these classical methods.
Classical methods look at the frequency, occurrence and order of words
in the text, whereas these neural approaches can capture long-term
semantic dependencies and context of words in a text.
""".replace("\n", " ")

keyphrases = generator(text)

print(keyphrases)

# Output
[['keyphrase extraction', 'text analysis', 'artificial intelligence', 'classical machine learning methods']]

📚 详细文档

模型描述

本模型以 T5-small模型为基础模型，并在 Inspec数据集上进行微调。关键短语生成Transformer被微调为一个文本到文本的生成问题，以生成关键短语。结果是一个由所有关键短语用给定分隔符（即 “;”）分隔的连接字符串。这些模型能够生成存在和不存在的关键短语。

预期用途与限制

🛑 限制

此关键短语生成模型非常针对特定领域，在科学论文摘要上表现很好。不建议将此模型用于其他领域，但你可以自由测试。
仅适用于英文文档。
有时输出可能没有意义。

❓ 如何使用

上述代码示例展示了如何使用该模型进行关键短语提取。

训练数据集

Inspec 是一个关键短语提取/生成数据集，由1998年至2002年发表的2000篇来自计算机与控制以及信息技术科学领域的英文科学论文组成。关键短语由专业索引人员或编辑进行标注。

你可以在论文中找到更多信息。

训练过程

训练参数

参数	值
学习率	5e-5
轮数	50
提前停止耐心值	1

预处理

数据集中的文档已经预处理成单词列表和相应的关键短语。唯一需要做的是进行分词，并将所有关键短语用特定的分隔符（;）连接成一个字符串。

from datasets import load_dataset
from transformers import AutoTokenizer

# Tokenizer
tokenizer = AutoTokenizer.from_pretrained("t5-small", add_prefix_space=True)

# Dataset parameters
dataset_full_name = "midas/inspec"
dataset_subset = "raw"
dataset_document_column = "document"

keyphrase_sep_token = ";"

def preprocess_keyphrases(text_ids, kp_list):
    kp_order_list = []
    kp_set = set(kp_list)
    text = tokenizer.decode(
        text_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True
    )
    text = text.lower()
    for kp in kp_set:
        kp = kp.strip()
        kp_index = text.find(kp.lower())
        kp_order_list.append((kp_index, kp))

    kp_order_list.sort()
    present_kp, absent_kp = [], []

    for kp_index, kp in kp_order_list:
        if kp_index < 0:
            absent_kp.append(kp)
        else:
            present_kp.append(kp)
    return present_kp, absent_kp


def preprocess_fuction(samples):
    processed_samples = {"input_ids": [], "attention_mask": [], "labels": []}
    for i, sample in enumerate(samples[dataset_document_column]):
        input_text = " ".join(sample)
        inputs = tokenizer(
            input_text,
            padding="max_length",
            truncation=True,
        )
        present_kp, absent_kp = preprocess_keyphrases(
            text_ids=inputs["input_ids"],
            kp_list=samples["extractive_keyphrases"][i]
            + samples["abstractive_keyphrases"][i],
        )
        keyphrases = present_kp
        keyphrases += absent_kp

        target_text = f" {keyphrase_sep_token} ".join(keyphrases)

        with tokenizer.as_target_tokenizer():
            targets = tokenizer(
                target_text, max_length=40, padding="max_length", truncation=True
            )
            targets["input_ids"] = [
                (t if t != tokenizer.pad_token_id else -100)
                for t in targets["input_ids"]
            ]
        for key in inputs.keys():
            processed_samples[key].append(inputs[key])
        processed_samples["labels"].append(targets["input_ids"])
    return processed_samples

# Load dataset
dataset = load_dataset(dataset_full_name, dataset_subset)
# Preprocess dataset
tokenized_dataset = dataset.map(preprocess_fuction, batched=True)

后处理

对于后处理，你需要根据关键短语分隔符拆分字符串。

def extract_keyphrases(examples):
    return [example.split(keyphrase_sep_token) for example in examples]

评估结果

传统的评估方法是精确率、召回率和F1分数 @k,m，其中k表示前k个预测的关键短语，m表示预测关键短语的平均数量。在关键短语生成中，还会关注F1@O，其中O表示真实关键短语的数量。

该模型在Inspec测试集上取得了以下结果：

提取式关键短语

数据集	P@5	R@5	F1@5	P@10	R@10	F1@10	P@M	R@M	F1@M	P@O	R@O	F1@O
Inspec测试集	0.33	0.31	0.29	0.17	0.31	0.20	0.41	0.31	0.32	0.28	0.28	0.28

抽象式关键短语

数据集	P@5	R@5	F1@5	P@10	R@10	F1@10	P@M	R@M	F1@M	P@O	R@O	F1@O
Inspec测试集	0.05	0.09	0.06	0.03	0.09	0.04	0.08	0.09	0.07	0.06	0.06	0.06

🔧 技术细节

在这项工作中，我们探索了如何学习针对从文本文档中学习关键短语丰富表示的特定任务语言模型。我们在判别和生成设置中试验了不同的掩码策略，用于预训练Transformer语言模型（LMs）。在判别设置中，我们引入了一个新的预训练目标 - 带替换的关键短语边界填充（KBIR），当使用KBIR预训练的LM针对关键短语提取任务进行微调时，与当前最优方法相比，性能有了很大提升（F1值最多提高9.26分）。在生成设置中，我们为BART引入了一个新的预训练设置 - KeyBART，它以CatSeq格式重现与输入文本相关的关键短语，而不是去噪后的原始输入。这也使得关键短语生成的性能比当前最优方法有所提升（F1@M最多提高4.33分）。此外，我们还在命名实体识别（NER）、问答（QA）、关系提取（RE）、抽象摘要等任务上对预训练语言模型进行微调，并取得了与当前最优方法相当的性能，表明学习关键短语的丰富表示确实有利于许多其他基础NLP任务。