模型简介
模型特点
模型能力
使用案例
🚀 🔑 关键短语提取模型:KBIR - OpenKP
关键短语提取是文本分析中的一项技术,用于从文档中提取重要的关键短语。借助这些关键短语,人们无需通读全文,就能快速轻松地理解文本内容。最初,关键短语提取主要由人工标注员完成,他们会详细阅读文本,然后写下最重要的关键短语。但缺点是,如果处理大量文档,这个过程会非常耗时 ⏳。
这时,人工智能 🤖 就派上用场了。目前,使用统计和语言特征的传统机器学习方法在提取过程中被广泛应用。而现在,借助深度学习,能够比这些传统方法更好地捕捉文本的语义。传统方法关注文本中单词的频率、出现次数和顺序,而这些神经方法可以捕捉文本中单词的长期语义依赖和上下文信息。
🚀 快速开始
安装依赖
from transformers import (
TokenClassificationPipeline,
AutoModelForTokenClassification,
AutoTokenizer,
)
from transformers.pipelines import AggregationStrategy
import numpy as np
定义关键短语提取管道
# Define keyphrase extraction pipeline
class KeyphraseExtractionPipeline(TokenClassificationPipeline):
def __init__(self, model, *args, **kwargs):
super().__init__(
model=AutoModelForTokenClassification.from_pretrained(model),
tokenizer=AutoTokenizer.from_pretrained(model),
*args,
**kwargs
)
def postprocess(self, all_outputs):
results = super().postprocess(
all_outputs=all_outputs,
aggregation_strategy=AggregationStrategy.SIMPLE,
)
return np.unique([result.get("word").strip() for result in results])
加载管道
# Load pipeline
model_name = "ml6team/keyphrase-extraction-kbir-openkp"
extractor = KeyphraseExtractionPipeline(model=model_name)
进行推理
# Inference
text = """
Keyphrase extraction is a technique in text analysis where you extract the
important keyphrases from a document. Thanks to these keyphrases humans can
understand the content of a text very quickly and easily without reading it
completely. Keyphrase extraction was first done primarily by human annotators,
who read the text in detail and then wrote down the most important keyphrases.
The disadvantage is that if you work with a lot of documents, this process
can take a lot of time.
Here is where Artificial Intelligence comes in. Currently, classical machine
learning methods, that use statistical and linguistic features, are widely used
for the extraction process. Now with deep learning, it is possible to capture
the semantic meaning of a text even better than these classical methods.
Classical methods look at the frequency, occurrence and order of words
in the text, whereas these neural approaches can capture long-term
semantic dependencies and context of words in a text.
""".replace("\n", " ")
keyphrases = extractor(text)
print(keyphrases)
输出结果
# Output
['keyphrase extraction' 'text analysis']
✨ 主要特性
- 以 KBIR 为基础模型,并在 OpenKP 数据集 上进行微调。
- 利用多任务学习设置优化组合损失,包括掩码语言模型(MLM)、关键短语边界填充(KBI)和关键短语替换分类(KRC)。
- 将关键短语提取问题转化为标记分类问题,对文档中的每个单词进行分类,判断其是否为关键短语的一部分。
📦 安装指南
本模型基于 Python 和 transformers
库,你可以使用以下命令安装所需依赖:
pip install transformers datasets numpy
💻 使用示例
基础用法
from transformers import (
TokenClassificationPipeline,
AutoModelForTokenClassification,
AutoTokenizer,
)
from transformers.pipelines import AggregationStrategy
import numpy as np
# Define keyphrase extraction pipeline
class KeyphraseExtractionPipeline(TokenClassificationPipeline):
def __init__(self, model, *args, **kwargs):
super().__init__(
model=AutoModelForTokenClassification.from_pretrained(model),
tokenizer=AutoTokenizer.from_pretrained(model),
*args,
**kwargs
)
def postprocess(self, all_outputs):
results = super().postprocess(
all_outputs=all_outputs,
aggregation_strategy=AggregationStrategy.SIMPLE,
)
return np.unique([result.get("word").strip() for result in results])
# Load pipeline
model_name = "ml6team/keyphrase-extraction-kbir-openkp"
extractor = KeyphraseExtractionPipeline(model=model_name)
# Inference
text = """
Keyphrase extraction is a technique in text analysis where you extract the
important keyphrases from a document. Thanks to these keyphrases humans can
understand the content of a text very quickly and easily without reading it
completely. Keyphrase extraction was first done primarily by human annotators,
who read the text in detail and then wrote down the most important keyphrases.
The disadvantage is that if you work with a lot of documents, this process
can take a lot of time.
Here is where Artificial Intelligence comes in. Currently, classical machine
learning methods, that use statistical and linguistic features, are widely used
for the extraction process. Now with deep learning, it is possible to capture
the semantic meaning of a text even better than these classical methods.
Classical methods look at the frequency, occurrence and order of words
in the text, whereas these neural approaches can capture long-term
semantic dependencies and context of words in a text.
""".replace("\n", " ")
keyphrases = extractor(text)
print(keyphrases)
高级用法
如果你不使用管道函数,需要手动过滤出标记为 B 和 I 的标记,并将它们合并成关键短语,最后去除不必要的空格。
# Labels
label_list = ["B", "I", "O"]
lbl2idx = {"B": 0, "I": 1, "O": 2}
idx2label = {0: "B", 1: "I", 2: "O"}
# Define post_process functions
def concat_tokens_by_tag(keyphrases):
keyphrase_tokens = []
for id, label in keyphrases:
if label == "B":
keyphrase_tokens.append([id])
elif label == "I":
if len(keyphrase_tokens) > 0:
keyphrase_tokens[len(keyphrase_tokens) - 1].append(id)
return keyphrase_tokens
def extract_keyphrases(example, predictions, tokenizer, index=0):
keyphrases_list = [
(id, idx2label[label])
for id, label in zip(
np.array(example["input_ids"]).squeeze().tolist(), predictions[index]
)
if idx2label[label] in ["B", "I"]
]
processed_keyphrases = concat_tokens_by_tag(keyphrases_list)
extracted_kps = tokenizer.batch_decode(
processed_keyphrases,
skip_special_tokens=True,
clean_up_tokenization_spaces=True,
)
return np.unique([kp.strip() for kp in extracted_kps])
📚 详细文档
📓 模型描述
本模型使用 KBIR 作为基础模型,并在 OpenKP 数据集 上进行微调。KBIR 即关键短语边界填充与替换,是一个预训练模型,它利用多任务学习设置来优化掩码语言模型(MLM)、关键短语边界填充(KBI)和关键短语替换分类(KRC)的组合损失。
你可以在这篇 论文 中找到关于该架构的更多信息。
关键短语提取模型是经过微调的变压器模型,将其作为标记分类问题处理,对文档中的每个单词进行分类,判断其是否为关键短语的一部分。
标签 | 描述 |
---|---|
B - KEY | 关键短语的开头 |
I - KEY | 关键短语内部 |
O | 关键短语外部 |
👷♂️ 训练过程
训练参数
参数 | 值 |
---|---|
学习率 | 1e - 4 |
训练轮数 | 50 |
提前停止耐心值 | 3 |
预处理
数据集中的文档已经预处理成单词列表和相应的标签。唯一需要做的是进行标记化,并重新调整标签,使其与正确的子词标记相对应。
from datasets import load_dataset
from transformers import AutoTokenizer
# Labels
label_list = ["B", "I", "O"]
lbl2idx = {"B": 0, "I": 1, "O": 2}
idx2label = {0: "B", 1: "I", 2: "O"}
# Tokenizer
tokenizer = AutoTokenizer.from_pretrained("bloomberg/KBIR")
max_length = 512
# Dataset parameters
dataset_full_name = "midas/openkp"
dataset_subset = "raw"
dataset_document_column = "document"
dataset_biotags_column = "doc_bio_tags"
def preprocess_fuction(all_samples_per_split):
tokenized_samples = tokenizer.batch_encode_plus(
all_samples_per_split[dataset_document_column],
padding="max_length",
truncation=True,
is_split_into_words=True,
max_length=max_length,
)
total_adjusted_labels = []
for k in range(0, len(tokenized_samples["input_ids"])):
prev_wid = -1
word_ids_list = tokenized_samples.word_ids(batch_index=k)
existing_label_ids = all_samples_per_split[dataset_biotags_column][k]
i = -1
adjusted_label_ids = []
for wid in word_ids_list:
if wid is None:
adjusted_label_ids.append(lbl2idx["O"])
elif wid != prev_wid:
i = i + 1
adjusted_label_ids.append(lbl2idx[existing_label_ids[i]])
prev_wid = wid
else:
adjusted_label_ids.append(
lbl2idx[
f"{'I' if existing_label_ids[i] == 'B' else existing_label_ids[i]}"
]
)
total_adjusted_labels.append(adjusted_label_ids)
tokenized_samples["labels"] = total_adjusted_labels
return tokenized_samples
# Load dataset
dataset = load_dataset(dataset_full_name, dataset_subset)
# Preprocess dataset
tokenized_dataset = dataset.map(preprocess_fuction, batched=True)
后处理(不使用管道函数)
如果你不使用管道函数,必须过滤出标记为 B 和 I 的标记。然后将每个 B 和 I 合并成一个关键短语。最后,需要去除关键短语中的不必要空格。
# Define post_process functions
def concat_tokens_by_tag(keyphrases):
keyphrase_tokens = []
for id, label in keyphrases:
if label == "B":
keyphrase_tokens.append([id])
elif label == "I":
if len(keyphrase_tokens) > 0:
keyphrase_tokens[len(keyphrase_tokens) - 1].append(id)
return keyphrase_tokens
def extract_keyphrases(example, predictions, tokenizer, index=0):
keyphrases_list = [
(id, idx2label[label])
for id, label in zip(
np.array(example["input_ids"]).squeeze().tolist(), predictions[index]
)
if idx2label[label] in ["B", "I"]
]
processed_keyphrases = concat_tokens_by_tag(keyphrases_list)
extracted_kps = tokenizer.batch_decode(
processed_keyphrases,
skip_special_tokens=True,
clean_up_tokenization_spaces=True,
)
return np.unique([kp.strip() for kp in extracted_kps])
📚 训练数据集
OpenKP 是一个大规模、开放领域的关键短语提取数据集,包含 148,124 个真实世界的网页文档,以及 1 - 3 个最相关的人工标注关键短语。
你可以在这篇 论文 中找到更多信息。
📝 评估结果
传统的评估方法是精确率、召回率和 F1 分数 @k,m,其中 k 表示前 k 个预测的关键短语,m 表示预测的关键短语的平均数量。
该模型在 OpenKP 测试集上取得了以下结果:
数据集 | P@5 | R@5 | F1@5 | P@10 | R@10 | F1@10 | P@M | R@M | F1@M |
---|---|---|---|---|---|---|---|---|---|
OpenKP 测试集 | 0.13 | 0.38 | 0.19 | 0.07 | 0.38 | 0.11 | 0.45 | 0.38 | 0.39 |
🔧 技术细节
本模型基于 KBIR 预训练模型,通过多任务学习优化组合损失。在训练过程中,使用了特定的预处理和后处理步骤,确保模型能够准确地提取关键短语。具体来说,预处理阶段对文档进行标记化和标签调整,后处理阶段则根据标记结果合并关键短语。
📄 许可证
本项目采用 MIT 许可证。
🚨 问题反馈
如果你有任何问题,请在社区板块自由发起讨论。








