keyphrase-extraction-distilbert-openkp开源英文关键词提取模型

首页

Keyphrase Extraction Distilbert Openkp

由 ml6team 开发

基于DistilBERT架构的英文关键词提取模型，在OpenKP数据集上微调，用于自动识别文本中的关键短语。

序列标注

Transformers

英语开源协议:MIT #英文关键词提取 #网页内容分析 #序列标注模型

下载量 32

发布时间 : 3/25/2022

模型简介

该模型通过分析文本内容自动提取重要关键词短语，帮助用户快速理解文档核心内容，无需完整阅读全文。适用于文档摘要、信息检索等场景。

模型特点

高效关键词提取

能够快速准确地从文本中提取关键短语，显著提升文档处理效率。

深度学习支持

采用神经网络架构，相比传统方法能更好地捕捉文本语义信息和上下文关联。

轻量级模型

基于DistilBERT架构，在保持性能的同时降低了计算资源需求。

模型能力

自动关键词提取

文本语义分析

文档内容摘要

使用案例

信息处理

文档摘要生成

自动提取文档关键信息生成简洁摘要

帮助用户快速掌握文档核心内容

搜索引擎优化

为网页内容提取关键词用于SEO优化

提升网页在搜索结果中的相关性排名

内容分析

新闻热点分析

从新闻文章中提取关键词识别热点话题

辅助媒体监测和趋势分析

🚀 关键短语提取模型：distilbert - openkp

关键短语提取是文本分析中的一项技术，用于从文档中提取重要的关键短语。借助这些关键短语，人们无需通读全文，就能快速轻松地理解文本内容。最初，关键短语提取主要由人工标注人员完成，他们仔细阅读文本，然后写下最重要的关键短语。但缺点是，如果处理大量文档，这个过程会耗费大量时间 ⏳。

这时，人工智能 🤖 就派上用场了。目前，使用统计和语言特征的传统机器学习方法在提取过程中被广泛应用。而现在，借助深度学习，甚至可以比这些传统方法更好地捕捉文本的语义。传统方法关注文本中单词的频率、出现次数和顺序，而这些神经方法可以捕捉文本中单词的长期语义依赖和上下文。

🚀 快速开始

模型使用

from transformers import (
    TokenClassificationPipeline,
    AutoModelForTokenClassification,
    AutoTokenizer,
)
from transformers.pipelines import AggregationStrategy
import numpy as np

# Define keyphrase extraction pipeline
class KeyphraseExtractionPipeline(TokenClassificationPipeline):
    def __init__(self, model, *args, **kwargs):
        super().__init__(
            model=AutoModelForTokenClassification.from_pretrained(model),
            tokenizer=AutoTokenizer.from_pretrained(model),
            *args,
            **kwargs
        )

    def postprocess(self, all_outputs):
        results = super().postprocess(
            all_outputs=all_outputs,
            aggregation_strategy=AggregationStrategy.FIRST,
        )
        return np.unique([result.get("word").strip() for result in results])

# Load pipeline
model_name = "ml6team/keyphrase-extraction-distilbert-openkp"
extractor = KeyphraseExtractionPipeline(model=model_name)

# Inference
text = """
Keyphrase extraction is a technique in text analysis where you extract the
important keyphrases from a document. Thanks to these keyphrases humans can
understand the content of a text very quickly and easily without reading it
completely. Keyphrase extraction was first done primarily by human annotators,
who read the text in detail and then wrote down the most important keyphrases.
The disadvantage is that if you work with a lot of documents, this process
can take a lot of time. 

Here is where Artificial Intelligence comes in. Currently, classical machine
learning methods, that use statistical and linguistic features, are widely used
for the extraction process. Now with deep learning, it is possible to capture
the semantic meaning of a text even better than these classical methods.
Classical methods look at the frequency, occurrence and order of words
in the text, whereas these neural approaches can capture long-term
semantic dependencies and context of words in a text.
""".replace("\n", " ")

keyphrases = extractor(text)

print(keyphrases)

# Output
['keyphrase extraction' 'text analysis']

✨ 主要特性

基于深度学习：利用深度学习技术，能更好地捕捉文本的语义信息，相比传统机器学习方法更具优势。
Transformer架构：作为一种Transformer模型，通过微调解决令牌分类问题，对文档中的每个单词进行分类，判断其是否为关键短语的一部分。

📦 安装指南

文档中未提及具体安装步骤，可参考transformers库的官方安装指南进行安装。

📚 详细文档

📓 模型描述

此模型以 KBIR 为基础模型，并在 OpenKP 数据集上进行微调。

关键短语提取模型是经过微调的Transformer模型，将其作为令牌分类问题处理，即对文档中的每个单词进行分类，判断其是否为关键短语的一部分。

标签	描述
B - KEY	关键短语的开头
I - KEY	关键短语内部
O	关键短语外部

✋ 预期用途与局限性

🛑 局限性

预测的关键短语数量有限。
仅适用于英文文档。

❓ 使用方法

上述快速开始部分已给出使用示例。

📚 训练数据集

OpenKP 是一个大规模、开放领域的关键短语提取数据集，包含148,124个真实世界的网页文档以及1 - 3个由人工标注的最相关关键短语。

你可以在论文中找到更多信息。

👷‍♂️ 训练过程

训练参数

参数	值
学习率	1e - 4
轮数	50
早停耐心值	3

预处理

数据集中的文档已预处理为单词列表及相应的标签。唯一需要做的是进行标记化，并重新调整标签，使其与正确的子词标记相对应。

from datasets import load_dataset
from transformers import AutoTokenizer

# Labels
label_list = ["B", "I", "O"]
lbl2idx = {"B": 0, "I": 1, "O": 2}
idx2label = {0: "B", 1: "I", 2: "O"}

# Tokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
max_length = 512

# Dataset parameters
dataset_full_name = "midas/openkp"
dataset_subset = "raw"
dataset_document_column = "document"
dataset_biotags_column = "doc_bio_tags"

def preprocess_fuction(all_samples_per_split):
    tokenized_samples = tokenizer.batch_encode_plus(
        all_samples_per_split[dataset_document_column],
        padding="max_length",
        truncation=True,
        is_split_into_words=True,
        max_length=max_length,
    )
    total_adjusted_labels = []
    for k in range(0, len(tokenized_samples["input_ids"])):
        prev_wid = -1
        word_ids_list = tokenized_samples.word_ids(batch_index=k)
        existing_label_ids = all_samples_per_split[dataset_biotags_column][k]
        i = -1
        adjusted_label_ids = []

        for wid in word_ids_list:
            if wid is None:
                adjusted_label_ids.append(lbl2idx["O"])
            elif wid != prev_wid:
                i = i + 1
                adjusted_label_ids.append(lbl2idx[existing_label_ids[i]])
                prev_wid = wid
            else:
                adjusted_label_ids.append(
                    lbl2idx[
                        f"{'I' if existing_label_ids[i] == 'B' else existing_label_ids[i]}"
                    ]
                )

        total_adjusted_labels.append(adjusted_label_ids)
    tokenized_samples["labels"] = total_adjusted_labels
    return tokenized_samples

# Load dataset
dataset = load_dataset(dataset_full_name, dataset_subset)

# Preprocess dataset
tokenized_dataset = dataset.map(preprocess_fuction, batched=True)

后处理（不使用管道函数）

如果不使用管道函数，必须过滤掉标记为B和I的令牌。然后将每个B和I合并为一个关键短语。最后，需要去除关键短语中的多余空格。

# Define post_process functions
def concat_tokens_by_tag(keyphrases):
    keyphrase_tokens = []
    for id, label in keyphrases:
        if label == "B":
            keyphrase_tokens.append([id])
        elif label == "I":
            if len(keyphrase_tokens) > 0:
                keyphrase_tokens[len(keyphrase_tokens) - 1].append(id)
    return keyphrase_tokens


def extract_keyphrases(example, predictions, tokenizer, index=0):
    keyphrases_list = [
        (id, idx2label[label])
        for id, label in zip(
            np.array(example["input_ids"]).squeeze().tolist(), predictions[index]
        )
        if idx2label[label] in ["B", "I"]
    ]

    processed_keyphrases = concat_tokens_by_tag(keyphrases_list)
    extracted_kps = tokenizer.batch_decode(
        processed_keyphrases,
        skip_special_tokens=True,
        clean_up_tokenization_spaces=True,
    )
    return np.unique([kp.strip() for kp in extracted_kps])