模型概述
模型特點
模型能力
使用案例
🚀 🔑 關鍵短語提取模型:KBIR - OpenKP
關鍵短語提取是文本分析中的一項技術,用於從文檔中提取重要的關鍵短語。藉助這些關鍵短語,人們無需通讀全文,就能快速輕鬆地理解文本內容。最初,關鍵短語提取主要由人工標註員完成,他們會詳細閱讀文本,然後寫下最重要的關鍵短語。但缺點是,如果處理大量文檔,這個過程會非常耗時 ⏳。
這時,人工智能 🤖 就派上用場了。目前,使用統計和語言特徵的傳統機器學習方法在提取過程中被廣泛應用。而現在,藉助深度學習,能夠比這些傳統方法更好地捕捉文本的語義。傳統方法關注文本中單詞的頻率、出現次數和順序,而這些神經方法可以捕捉文本中單詞的長期語義依賴和上下文信息。
🚀 快速開始
安裝依賴
from transformers import (
TokenClassificationPipeline,
AutoModelForTokenClassification,
AutoTokenizer,
)
from transformers.pipelines import AggregationStrategy
import numpy as np
定義關鍵短語提取管道
# Define keyphrase extraction pipeline
class KeyphraseExtractionPipeline(TokenClassificationPipeline):
def __init__(self, model, *args, **kwargs):
super().__init__(
model=AutoModelForTokenClassification.from_pretrained(model),
tokenizer=AutoTokenizer.from_pretrained(model),
*args,
**kwargs
)
def postprocess(self, all_outputs):
results = super().postprocess(
all_outputs=all_outputs,
aggregation_strategy=AggregationStrategy.SIMPLE,
)
return np.unique([result.get("word").strip() for result in results])
加載管道
# Load pipeline
model_name = "ml6team/keyphrase-extraction-kbir-openkp"
extractor = KeyphraseExtractionPipeline(model=model_name)
進行推理
# Inference
text = """
Keyphrase extraction is a technique in text analysis where you extract the
important keyphrases from a document. Thanks to these keyphrases humans can
understand the content of a text very quickly and easily without reading it
completely. Keyphrase extraction was first done primarily by human annotators,
who read the text in detail and then wrote down the most important keyphrases.
The disadvantage is that if you work with a lot of documents, this process
can take a lot of time.
Here is where Artificial Intelligence comes in. Currently, classical machine
learning methods, that use statistical and linguistic features, are widely used
for the extraction process. Now with deep learning, it is possible to capture
the semantic meaning of a text even better than these classical methods.
Classical methods look at the frequency, occurrence and order of words
in the text, whereas these neural approaches can capture long-term
semantic dependencies and context of words in a text.
""".replace("\n", " ")
keyphrases = extractor(text)
print(keyphrases)
輸出結果
# Output
['keyphrase extraction' 'text analysis']
✨ 主要特性
- 以 KBIR 為基礎模型,並在 OpenKP 數據集 上進行微調。
- 利用多任務學習設置優化組合損失,包括掩碼語言模型(MLM)、關鍵短語邊界填充(KBI)和關鍵短語替換分類(KRC)。
- 將關鍵短語提取問題轉化為標記分類問題,對文檔中的每個單詞進行分類,判斷其是否為關鍵短語的一部分。
📦 安裝指南
本模型基於 Python 和 transformers
庫,你可以使用以下命令安裝所需依賴:
pip install transformers datasets numpy
💻 使用示例
基礎用法
from transformers import (
TokenClassificationPipeline,
AutoModelForTokenClassification,
AutoTokenizer,
)
from transformers.pipelines import AggregationStrategy
import numpy as np
# Define keyphrase extraction pipeline
class KeyphraseExtractionPipeline(TokenClassificationPipeline):
def __init__(self, model, *args, **kwargs):
super().__init__(
model=AutoModelForTokenClassification.from_pretrained(model),
tokenizer=AutoTokenizer.from_pretrained(model),
*args,
**kwargs
)
def postprocess(self, all_outputs):
results = super().postprocess(
all_outputs=all_outputs,
aggregation_strategy=AggregationStrategy.SIMPLE,
)
return np.unique([result.get("word").strip() for result in results])
# Load pipeline
model_name = "ml6team/keyphrase-extraction-kbir-openkp"
extractor = KeyphraseExtractionPipeline(model=model_name)
# Inference
text = """
Keyphrase extraction is a technique in text analysis where you extract the
important keyphrases from a document. Thanks to these keyphrases humans can
understand the content of a text very quickly and easily without reading it
completely. Keyphrase extraction was first done primarily by human annotators,
who read the text in detail and then wrote down the most important keyphrases.
The disadvantage is that if you work with a lot of documents, this process
can take a lot of time.
Here is where Artificial Intelligence comes in. Currently, classical machine
learning methods, that use statistical and linguistic features, are widely used
for the extraction process. Now with deep learning, it is possible to capture
the semantic meaning of a text even better than these classical methods.
Classical methods look at the frequency, occurrence and order of words
in the text, whereas these neural approaches can capture long-term
semantic dependencies and context of words in a text.
""".replace("\n", " ")
keyphrases = extractor(text)
print(keyphrases)
高級用法
如果你不使用管道函數,需要手動過濾出標記為 B 和 I 的標記,並將它們合併成關鍵短語,最後去除不必要的空格。
# Labels
label_list = ["B", "I", "O"]
lbl2idx = {"B": 0, "I": 1, "O": 2}
idx2label = {0: "B", 1: "I", 2: "O"}
# Define post_process functions
def concat_tokens_by_tag(keyphrases):
keyphrase_tokens = []
for id, label in keyphrases:
if label == "B":
keyphrase_tokens.append([id])
elif label == "I":
if len(keyphrase_tokens) > 0:
keyphrase_tokens[len(keyphrase_tokens) - 1].append(id)
return keyphrase_tokens
def extract_keyphrases(example, predictions, tokenizer, index=0):
keyphrases_list = [
(id, idx2label[label])
for id, label in zip(
np.array(example["input_ids"]).squeeze().tolist(), predictions[index]
)
if idx2label[label] in ["B", "I"]
]
processed_keyphrases = concat_tokens_by_tag(keyphrases_list)
extracted_kps = tokenizer.batch_decode(
processed_keyphrases,
skip_special_tokens=True,
clean_up_tokenization_spaces=True,
)
return np.unique([kp.strip() for kp in extracted_kps])
📚 詳細文檔
📓 模型描述
本模型使用 KBIR 作為基礎模型,並在 OpenKP 數據集 上進行微調。KBIR 即關鍵短語邊界填充與替換,是一個預訓練模型,它利用多任務學習設置來優化掩碼語言模型(MLM)、關鍵短語邊界填充(KBI)和關鍵短語替換分類(KRC)的組合損失。
你可以在這篇 論文 中找到關於該架構的更多信息。
關鍵短語提取模型是經過微調的變壓器模型,將其作為標記分類問題處理,對文檔中的每個單詞進行分類,判斷其是否為關鍵短語的一部分。
標籤 | 描述 |
---|---|
B - KEY | 關鍵短語的開頭 |
I - KEY | 關鍵短語內部 |
O | 關鍵短語外部 |
👷♂️ 訓練過程
訓練參數
參數 | 值 |
---|---|
學習率 | 1e - 4 |
訓練輪數 | 50 |
提前停止耐心值 | 3 |
預處理
數據集中的文檔已經預處理成單詞列表和相應的標籤。唯一需要做的是進行標記化,並重新調整標籤,使其與正確的子詞標記相對應。
from datasets import load_dataset
from transformers import AutoTokenizer
# Labels
label_list = ["B", "I", "O"]
lbl2idx = {"B": 0, "I": 1, "O": 2}
idx2label = {0: "B", 1: "I", 2: "O"}
# Tokenizer
tokenizer = AutoTokenizer.from_pretrained("bloomberg/KBIR")
max_length = 512
# Dataset parameters
dataset_full_name = "midas/openkp"
dataset_subset = "raw"
dataset_document_column = "document"
dataset_biotags_column = "doc_bio_tags"
def preprocess_fuction(all_samples_per_split):
tokenized_samples = tokenizer.batch_encode_plus(
all_samples_per_split[dataset_document_column],
padding="max_length",
truncation=True,
is_split_into_words=True,
max_length=max_length,
)
total_adjusted_labels = []
for k in range(0, len(tokenized_samples["input_ids"])):
prev_wid = -1
word_ids_list = tokenized_samples.word_ids(batch_index=k)
existing_label_ids = all_samples_per_split[dataset_biotags_column][k]
i = -1
adjusted_label_ids = []
for wid in word_ids_list:
if wid is None:
adjusted_label_ids.append(lbl2idx["O"])
elif wid != prev_wid:
i = i + 1
adjusted_label_ids.append(lbl2idx[existing_label_ids[i]])
prev_wid = wid
else:
adjusted_label_ids.append(
lbl2idx[
f"{'I' if existing_label_ids[i] == 'B' else existing_label_ids[i]}"
]
)
total_adjusted_labels.append(adjusted_label_ids)
tokenized_samples["labels"] = total_adjusted_labels
return tokenized_samples
# Load dataset
dataset = load_dataset(dataset_full_name, dataset_subset)
# Preprocess dataset
tokenized_dataset = dataset.map(preprocess_fuction, batched=True)
後處理(不使用管道函數)
如果你不使用管道函數,必須過濾出標記為 B 和 I 的標記。然後將每個 B 和 I 合併成一個關鍵短語。最後,需要去除關鍵短語中的不必要空格。
# Define post_process functions
def concat_tokens_by_tag(keyphrases):
keyphrase_tokens = []
for id, label in keyphrases:
if label == "B":
keyphrase_tokens.append([id])
elif label == "I":
if len(keyphrase_tokens) > 0:
keyphrase_tokens[len(keyphrase_tokens) - 1].append(id)
return keyphrase_tokens
def extract_keyphrases(example, predictions, tokenizer, index=0):
keyphrases_list = [
(id, idx2label[label])
for id, label in zip(
np.array(example["input_ids"]).squeeze().tolist(), predictions[index]
)
if idx2label[label] in ["B", "I"]
]
processed_keyphrases = concat_tokens_by_tag(keyphrases_list)
extracted_kps = tokenizer.batch_decode(
processed_keyphrases,
skip_special_tokens=True,
clean_up_tokenization_spaces=True,
)
return np.unique([kp.strip() for kp in extracted_kps])
📚 訓練數據集
OpenKP 是一個大規模、開放領域的關鍵短語提取數據集,包含 148,124 個真實世界的網頁文檔,以及 1 - 3 個最相關的人工標註關鍵短語。
你可以在這篇 論文 中找到更多信息。
📝 評估結果
傳統的評估方法是精確率、召回率和 F1 分數 @k,m,其中 k 表示前 k 個預測的關鍵短語,m 表示預測的關鍵短語的平均數量。
該模型在 OpenKP 測試集上取得了以下結果:
數據集 | P@5 | R@5 | F1@5 | P@10 | R@10 | F1@10 | P@M | R@M | F1@M |
---|---|---|---|---|---|---|---|---|---|
OpenKP 測試集 | 0.13 | 0.38 | 0.19 | 0.07 | 0.38 | 0.11 | 0.45 | 0.38 | 0.39 |
🔧 技術細節
本模型基於 KBIR 預訓練模型,通過多任務學習優化組合損失。在訓練過程中,使用了特定的預處理和後處理步驟,確保模型能夠準確地提取關鍵短語。具體來說,預處理階段對文檔進行標記化和標籤調整,後處理階段則根據標記結果合併關鍵短語。
📄 許可證
本項目採用 MIT 許可證。
🚨 問題反饋
如果你有任何問題,請在社區板塊自由發起討論。








