keyphrase-extraction-kbir-openkp开源模型 - 从英文文本中精准提取重要关键短语

首页

Keyphrase Extraction Kbir Openkp

由 ml6team 开发

基于KBIR架构的关键词抽取模型，在OpenKP数据集上微调，用于从英文文本中提取重要关键短语

序列标注

Transformers

英语开源协议:MIT #多任务学习框架 #网页关键短语抽取 #语义边界建模

下载量 90

发布时间 : 6/16/2022

模型简介

该模型将关键词抽取任务转化为词元分类问题，通过判断每个词是否属于关键短语的起始(B-KEY)、内部(I-KEY)或外部(O)来提取关键短语

模型特点

多任务学习框架

联合优化掩码语言建模(MLM)、关键短语边界填充(KBI)和关键短语替换分类(KRC)的损失函数

语义理解

相比传统基于词频的方法，能更好地捕捉文本的长期语义依赖关系和上下文

高效标注

自动化关键短语抽取，显著减少人工标注海量文档的时间成本

模型能力

英文关键词抽取

语义关键短语识别

文档内容摘要生成

使用案例

文本分析

文档快速理解

通过提取的关键短语快速掌握文档核心内容，无需完整阅读

提升信息检索效率

内容索引构建

为大规模文档集合自动生成关键词索引

优化搜索引擎效果

知识管理

学术文献分析

从研究论文中提取核心概念和术语

加速文献综述过程

🚀 🔑 关键短语提取模型：KBIR - OpenKP

关键短语提取是文本分析中的一项技术，用于从文档中提取重要的关键短语。借助这些关键短语，人们无需通读全文，就能快速轻松地理解文本内容。最初，关键短语提取主要由人工标注员完成，他们会详细阅读文本，然后写下最重要的关键短语。但缺点是，如果处理大量文档，这个过程会非常耗时 ⏳。

这时，人工智能 🤖 就派上用场了。目前，使用统计和语言特征的传统机器学习方法在提取过程中被广泛应用。而现在，借助深度学习，能够比这些传统方法更好地捕捉文本的语义。传统方法关注文本中单词的频率、出现次数和顺序，而这些神经方法可以捕捉文本中单词的长期语义依赖和上下文信息。

🚀 快速开始

安装依赖

from transformers import (
    TokenClassificationPipeline,
    AutoModelForTokenClassification,
    AutoTokenizer,
)
from transformers.pipelines import AggregationStrategy
import numpy as np

定义关键短语提取管道

# Define keyphrase extraction pipeline
class KeyphraseExtractionPipeline(TokenClassificationPipeline):
    def __init__(self, model, *args, **kwargs):
        super().__init__(
            model=AutoModelForTokenClassification.from_pretrained(model),
            tokenizer=AutoTokenizer.from_pretrained(model),
            *args,
            **kwargs
        )

    def postprocess(self, all_outputs):
        results = super().postprocess(
            all_outputs=all_outputs,
            aggregation_strategy=AggregationStrategy.SIMPLE,
        )
        return np.unique([result.get("word").strip() for result in results])

加载管道

# Load pipeline
model_name = "ml6team/keyphrase-extraction-kbir-openkp"
extractor = KeyphraseExtractionPipeline(model=model_name)

进行推理

# Inference
text = """
Keyphrase extraction is a technique in text analysis where you extract the
important keyphrases from a document. Thanks to these keyphrases humans can
understand the content of a text very quickly and easily without reading it
completely. Keyphrase extraction was first done primarily by human annotators,
who read the text in detail and then wrote down the most important keyphrases.
The disadvantage is that if you work with a lot of documents, this process
can take a lot of time. 

Here is where Artificial Intelligence comes in. Currently, classical machine
learning methods, that use statistical and linguistic features, are widely used
for the extraction process. Now with deep learning, it is possible to capture
the semantic meaning of a text even better than these classical methods.
Classical methods look at the frequency, occurrence and order of words
in the text, whereas these neural approaches can capture long-term
semantic dependencies and context of words in a text.
""".replace("\n", " ")

keyphrases = extractor(text)

print(keyphrases)

输出结果

# Output
['keyphrase extraction' 'text analysis']

✨ 主要特性

以 KBIR 为基础模型，并在 OpenKP 数据集上进行微调。
利用多任务学习设置优化组合损失，包括掩码语言模型（MLM）、关键短语边界填充（KBI）和关键短语替换分类（KRC）。
将关键短语提取问题转化为标记分类问题，对文档中的每个单词进行分类，判断其是否为关键短语的一部分。

📦 安装指南

本模型基于 Python 和 transformers 库，你可以使用以下命令安装所需依赖：

pip install transformers datasets numpy

💻 使用示例

基础用法

from transformers import (
    TokenClassificationPipeline,
    AutoModelForTokenClassification,
    AutoTokenizer,
)
from transformers.pipelines import AggregationStrategy
import numpy as np

# Define keyphrase extraction pipeline
class KeyphraseExtractionPipeline(TokenClassificationPipeline):
    def __init__(self, model, *args, **kwargs):
        super().__init__(
            model=AutoModelForTokenClassification.from_pretrained(model),
            tokenizer=AutoTokenizer.from_pretrained(model),
            *args,
            **kwargs
        )

    def postprocess(self, all_outputs):
        results = super().postprocess(
            all_outputs=all_outputs,
            aggregation_strategy=AggregationStrategy.SIMPLE,
        )
        return np.unique([result.get("word").strip() for result in results])

# Load pipeline
model_name = "ml6team/keyphrase-extraction-kbir-openkp"
extractor = KeyphraseExtractionPipeline(model=model_name)

# Inference
text = """
Keyphrase extraction is a technique in text analysis where you extract the
important keyphrases from a document. Thanks to these keyphrases humans can
understand the content of a text very quickly and easily without reading it
completely. Keyphrase extraction was first done primarily by human annotators,
who read the text in detail and then wrote down the most important keyphrases.
The disadvantage is that if you work with a lot of documents, this process
can take a lot of time. 

Here is where Artificial Intelligence comes in. Currently, classical machine
learning methods, that use statistical and linguistic features, are widely used
for the extraction process. Now with deep learning, it is possible to capture
the semantic meaning of a text even better than these classical methods.
Classical methods look at the frequency, occurrence and order of words
in the text, whereas these neural approaches can capture long-term
semantic dependencies and context of words in a text.
""".replace("\n", " ")

keyphrases = extractor(text)

print(keyphrases)

高级用法

如果你不使用管道函数，需要手动过滤出标记为 B 和 I 的标记，并将它们合并成关键短语，最后去除不必要的空格。

# Labels
label_list = ["B", "I", "O"]
lbl2idx = {"B": 0, "I": 1, "O": 2}
idx2label = {0: "B", 1: "I", 2: "O"}

# Define post_process functions
def concat_tokens_by_tag(keyphrases):
    keyphrase_tokens = []
    for id, label in keyphrases:
        if label == "B":
            keyphrase_tokens.append([id])
        elif label == "I":
            if len(keyphrase_tokens) > 0:
                keyphrase_tokens[len(keyphrase_tokens) - 1].append(id)
    return keyphrase_tokens


def extract_keyphrases(example, predictions, tokenizer, index=0):
    keyphrases_list = [
        (id, idx2label[label])
        for id, label in zip(
            np.array(example["input_ids"]).squeeze().tolist(), predictions[index]
        )
        if idx2label[label] in ["B", "I"]
    ]

    processed_keyphrases = concat_tokens_by_tag(keyphrases_list)
    extracted_kps = tokenizer.batch_decode(
        processed_keyphrases,
        skip_special_tokens=True,
        clean_up_tokenization_spaces=True,
    )
    return np.unique([kp.strip() for kp in extracted_kps])

📚 详细文档

📓 模型描述

本模型使用 KBIR 作为基础模型，并在 OpenKP 数据集上进行微调。KBIR 即关键短语边界填充与替换，是一个预训练模型，它利用多任务学习设置来优化掩码语言模型（MLM）、关键短语边界填充（KBI）和关键短语替换分类（KRC）的组合损失。

你可以在这篇论文中找到关于该架构的更多信息。

关键短语提取模型是经过微调的变压器模型，将其作为标记分类问题处理，对文档中的每个单词进行分类，判断其是否为关键短语的一部分。

标签	描述
B - KEY	关键短语的开头
I - KEY	关键短语内部
O	关键短语外部

👷‍♂️ 训练过程

训练参数

参数	值
学习率	1e - 4
训练轮数	50
提前停止耐心值	3

预处理

数据集中的文档已经预处理成单词列表和相应的标签。唯一需要做的是进行标记化，并重新调整标签，使其与正确的子词标记相对应。

from datasets import load_dataset
from transformers import AutoTokenizer

# Labels
label_list = ["B", "I", "O"]
lbl2idx = {"B": 0, "I": 1, "O": 2}
idx2label = {0: "B", 1: "I", 2: "O"}

# Tokenizer
tokenizer = AutoTokenizer.from_pretrained("bloomberg/KBIR")
max_length = 512

# Dataset parameters
dataset_full_name = "midas/openkp"
dataset_subset = "raw"
dataset_document_column = "document"
dataset_biotags_column = "doc_bio_tags"

def preprocess_fuction(all_samples_per_split):
    tokenized_samples = tokenizer.batch_encode_plus(
        all_samples_per_split[dataset_document_column],
        padding="max_length",
        truncation=True,
        is_split_into_words=True,
        max_length=max_length,
    )
    total_adjusted_labels = []
    for k in range(0, len(tokenized_samples["input_ids"])):
        prev_wid = -1
        word_ids_list = tokenized_samples.word_ids(batch_index=k)
        existing_label_ids = all_samples_per_split[dataset_biotags_column][k]
        i = -1
        adjusted_label_ids = []

        for wid in word_ids_list:
            if wid is None:
                adjusted_label_ids.append(lbl2idx["O"])
            elif wid != prev_wid:
                i = i + 1
                adjusted_label_ids.append(lbl2idx[existing_label_ids[i]])
                prev_wid = wid
            else:
                adjusted_label_ids.append(
                    lbl2idx[
                        f"{'I' if existing_label_ids[i] == 'B' else existing_label_ids[i]}"
                    ]
                )

        total_adjusted_labels.append(adjusted_label_ids)
    tokenized_samples["labels"] = total_adjusted_labels
    return tokenized_samples

# Load dataset
dataset = load_dataset(dataset_full_name, dataset_subset)

# Preprocess dataset
tokenized_dataset = dataset.map(preprocess_fuction, batched=True)

后处理（不使用管道函数）

如果你不使用管道函数，必须过滤出标记为 B 和 I 的标记。然后将每个 B 和 I 合并成一个关键短语。最后，需要去除关键短语中的不必要空格。

# Define post_process functions
def concat_tokens_by_tag(keyphrases):
    keyphrase_tokens = []
    for id, label in keyphrases:
        if label == "B":
            keyphrase_tokens.append([id])
        elif label == "I":
            if len(keyphrase_tokens) > 0:
                keyphrase_tokens[len(keyphrase_tokens) - 1].append(id)
    return keyphrase_tokens


def extract_keyphrases(example, predictions, tokenizer, index=0):
    keyphrases_list = [
        (id, idx2label[label])
        for id, label in zip(
            np.array(example["input_ids"]).squeeze().tolist(), predictions[index]
        )
        if idx2label[label] in ["B", "I"]
    ]

    processed_keyphrases = concat_tokens_by_tag(keyphrases_list)
    extracted_kps = tokenizer.batch_decode(
        processed_keyphrases,
        skip_special_tokens=True,
        clean_up_tokenization_spaces=True,
    )
    return np.unique([kp.strip() for kp in extracted_kps])