keyphrase-extraction-distilbert-inspec开源模型 - 免费提取英文科学论文摘要关键词

首页

Keyphrase Extraction Distilbert Inspec

由 ml6team 开发

基于DistilBERT的英文关键词提取模型，在科学论文摘要领域表现优异。

序列标注

Transformers

英语开源协议:MIT #英文关键词抽取 #科学论文摘要 #DistilBERT微调

下载量 22.07k

发布时间 : 3/25/2022

模型简介

该模型通过微调DistilBERT实现关键词组序列标注，能自动从文档中提取重要关键词组，适用于快速理解文本内容。

模型特点

领域专业化

针对科学论文摘要优化，在计算机与控制领域表现最佳

轻量级架构

基于DistilBERT的压缩模型，保持性能的同时减少计算资源需求

序列标注方法

采用BIO标注方案精准捕捉关键词组边界

模型能力

英文关键词提取

科学文献分析

语义信息捕捉

使用案例

学术研究

论文摘要分析

自动提取科研论文的核心概念关键词

F1@M达0.49

信息检索

文档索引构建

为大量文献自动生成检索关键词

比人工标注效率提升90%

🚀 关键短语提取模型：distilbert-inspec

关键短语提取是文本分析中的一项技术，可从文档中提取重要的关键短语。借助这些关键短语，人们无需通读文本，就能快速轻松地理解其内容。最初，关键短语提取主要由人工标注人员完成，他们会详细阅读文本，然后记录下最重要的关键短语。但缺点是，如果处理大量文档，这个过程会非常耗时 ⏳。

这时，人工智能 🤖 就派上用场了。目前，使用统计和语言特征的传统机器学习方法在提取过程中被广泛应用。而现在，借助深度学习，甚至可以比这些传统方法更好地捕捉文本的语义。传统方法关注文本中单词的频率、出现次数和顺序，而这些基于神经网络的方法可以捕捉文本中单词的长期语义依赖和上下文信息。

🚀 快速开始

你可以按照以下步骤使用该关键短语提取模型：

from transformers import (
    TokenClassificationPipeline,
    AutoModelForTokenClassification,
    AutoTokenizer,
)
from transformers.pipelines import AggregationStrategy
import numpy as np

# Define keyphrase extraction pipeline
class KeyphraseExtractionPipeline(TokenClassificationPipeline):
    def __init__(self, model, *args, **kwargs):
        super().__init__(
            model=AutoModelForTokenClassification.from_pretrained(model),
            tokenizer=AutoTokenizer.from_pretrained(model),
            *args,
            **kwargs
        )

    def postprocess(self, all_outputs):
        results = super().postprocess(
            all_outputs=all_outputs,
            aggregation_strategy=AggregationStrategy.FIRST,
        )
        return np.unique([result.get("word").strip() for result in results])

# Load pipeline
model_name = "ml6team/keyphrase-extraction-distilbert-inspec"
extractor = KeyphraseExtractionPipeline(model=model_name)

# Inference
text = """
Keyphrase extraction is a technique in text analysis where you extract the
important keyphrases from a document. Thanks to these keyphrases humans can
understand the content of a text very quickly and easily without reading it
completely. Keyphrase extraction was first done primarily by human annotators,
who read the text in detail and then wrote down the most important keyphrases.
The disadvantage is that if you work with a lot of documents, this process
can take a lot of time. 

Here is where Artificial Intelligence comes in. Currently, classical machine
learning methods, that use statistical and linguistic features, are widely used
for the extraction process. Now with deep learning, it is possible to capture
the semantic meaning of a text even better than these classical methods.
Classical methods look at the frequency, occurrence and order of words
in the text, whereas these neural approaches can capture long-term
semantic dependencies and context of words in a text.
""".replace("\n", " ")

keyphrases = extractor(text)

print(keyphrases)

# Output
['artificial intelligence' 'classical machine learning' 'deep learning'
 'keyphrase extraction' 'linguistic features' 'statistical'
 'text analysis']

✨ 主要特性

基于深度学习：利用深度学习技术，能更好地捕捉文本的语义信息，相比传统机器学习方法有更优的性能。
特定领域表现出色：在科学论文摘要的关键短语提取任务上表现优异。

📦 安装指南

文档未提及具体安装命令，故跳过此部分。

💻 使用示例

基础用法

from transformers import (
    TokenClassificationPipeline,
    AutoModelForTokenClassification,
    AutoTokenizer,
)
from transformers.pipelines import AggregationStrategy
import numpy as np

# Define keyphrase extraction pipeline
class KeyphraseExtractionPipeline(TokenClassificationPipeline):
    def __init__(self, model, *args, **kwargs):
        super().__init__(
            model=AutoModelForTokenClassification.from_pretrained(model),
            tokenizer=AutoTokenizer.from_pretrained(model),
            *args,
            **kwargs
        )

    def postprocess(self, all_outputs):
        results = super().postprocess(
            all_outputs=all_outputs,
            aggregation_strategy=AggregationStrategy.FIRST,
        )
        return np.unique([result.get("word").strip() for result in results])

# Load pipeline
model_name = "ml6team/keyphrase-extraction-distilbert-inspec"
extractor = KeyphraseExtractionPipeline(model=model_name)

# Inference
text = """
Keyphrase extraction is a technique in text analysis where you extract the
important keyphrases from a document. Thanks to these keyphrases humans can
understand the content of a text very quickly and easily without reading it
completely. Keyphrase extraction was first done primarily by human annotators,
who read the text in detail and then wrote down the most important keyphrases.
The disadvantage is that if you work with a lot of documents, this process
can take a lot of time. 

Here is where Artificial Intelligence comes in. Currently, classical machine
learning methods, that use statistical and linguistic features, are widely used
for the extraction process. Now with deep learning, it is possible to capture
the semantic meaning of a text even better than these classical methods.
Classical methods look at the frequency, occurrence and order of words
in the text, whereas these neural approaches can capture long-term
semantic dependencies and context of words in a text.
""".replace("\n", " ")

keyphrases = extractor(text)

print(keyphrases)

高级用法

文档未提及高级用法的代码示例，故跳过此部分。

📚 详细文档

📓 模型描述

该模型以 distilbert 为基础模型，并在 Inspec 数据集上进行微调。

关键短语提取模型是经过微调的Transformer模型，将其作为一个标记分类问题，即对文档中的每个单词进行分类，判断其是否为关键短语的一部分。

标签	描述
B-KEY	关键短语的开头
I-KEY	关键短语的内部
O	关键短语之外

✋ 预期用途与限制

🛑 限制

领域特定性：该关键短语提取模型具有很强的领域特定性，在科学论文摘要上表现出色，不建议用于其他领域，但你可以自行测试。
语言限制：仅适用于英文文档。

📚 训练数据集

Inspec 是一个关键短语提取/生成数据集，包含2000篇英文科学论文，这些论文来自计算机、控制和信息技术等科学领域，发表于1998年至2002年之间。关键短语由专业索引人员或编辑进行标注。

你可以在论文中找到更多信息。

👷‍♂️ 训练过程

训练参数

参数	值
学习率	1e-4
轮数	50
提前停止耐心值	3

预处理

数据集中的文档已经预处理成单词列表和相应的标签。唯一需要做的是进行分词，并重新调整标签，使其与正确的子词标记相对应。

from datasets import load_dataset
from transformers import AutoTokenizer

# Labels
label_list = ["B", "I", "O"]
lbl2idx = {"B": 0, "I": 1, "O": 2}
idx2label = {0: "B", 1: "I", 2: "O"}

# Tokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
max_length = 512

# Dataset parameters
dataset_full_name = "midas/inspec"
dataset_subset = "raw"
dataset_document_column = "document"
dataset_biotags_column = "doc_bio_tags"

def preprocess_fuction(all_samples_per_split):
    tokenized_samples = tokenizer.batch_encode_plus(
        all_samples_per_split[dataset_document_column],
        padding="max_length",
        truncation=True,
        is_split_into_words=True,
        max_length=max_length,
    )
    total_adjusted_labels = []
    for k in range(0, len(tokenized_samples["input_ids"])):
        prev_wid = -1
        word_ids_list = tokenized_samples.word_ids(batch_index=k)
        existing_label_ids = all_samples_per_split[dataset_biotags_column][k]
        i = -1
        adjusted_label_ids = []

        for wid in word_ids_list:
            if wid is None:
                adjusted_label_ids.append(lbl2idx["O"])
            elif wid != prev_wid:
                i = i + 1
                adjusted_label_ids.append(lbl2idx[existing_label_ids[i]])
                prev_wid = wid
            else:
                adjusted_label_ids.append(
                    lbl2idx[
                        f"{'I' if existing_label_ids[i] == 'B' else existing_label_ids[i]}"
                    ]
                )

        total_adjusted_labels.append(adjusted_label_ids)
    tokenized_samples["labels"] = total_adjusted_labels
    return tokenized_samples

# Load dataset
dataset = load_dataset(dataset_full_name, dataset_subset)

# Preprocess dataset
tokenized_dataset = dataset.map(preprocess_fuction, batched=True)

后处理（不使用管道函数）

如果你不使用管道函数，则必须过滤掉标记为 B 和 I 的标记。然后将每个 B 和 I 合并成一个关键短语。最后，需要去除关键短语中的不必要空格。

# Define post_process functions
def concat_tokens_by_tag(keyphrases):
    keyphrase_tokens = []
    for id, label in keyphrases:
        if label == "B":
            keyphrase_tokens.append([id])
        elif label == "I":
            if len(keyphrase_tokens) > 0:
                keyphrase_tokens[len(keyphrase_tokens) - 1].append(id)
    return keyphrase_tokens


def extract_keyphrases(example, predictions, tokenizer, index=0):
    keyphrases_list = [
        (id, idx2label[label])
        for id, label in zip(
            np.array(example["input_ids"]).squeeze().tolist(), predictions[index]
        )
        if idx2label[label] in ["B", "I"]
    ]

    processed_keyphrases = concat_tokens_by_tag(keyphrases_list)
    extracted_kps = tokenizer.batch_decode(
        processed_keyphrases,
        skip_special_tokens=True,
        clean_up_tokenization_spaces=True,
    )
    return np.unique([kp.strip() for kp in extracted_kps])

📝 评估结果

传统的评估方法是精确率、召回率和 F1 分数 @k,m，其中 k 表示前 k 个预测的关键短语，m 表示预测的关键短语的平均数量。

该模型在 Inspec 测试集上取得了以下结果：

数据集	P@5	R@5	F1@5	P@10	R@10	F1@10	P@M	R@M	F1@M
Inspec 测试集	0.45	0.40	0.39	0.33	0.53	0.38	0.47	0.57	0.49

🔧 技术细节

该模型基于Transformer架构，通过微调distilbert模型在Inspec数据集上进行关键短语提取任务。在训练过程中，采用了特定的预处理和后处理步骤，以确保模型能够准确地识别关键短语。在预处理阶段，对文档进行分词和标签调整；在后处理阶段，过滤和合并标记以生成最终的关键短语。同时，模型在训练过程中使用了特定的训练参数，如学习率、轮数和提前停止耐心值，以优化模型性能。