模型概述
模型特點
模型能力
使用案例
🚀 布穀鳥模型(Cuckoo)
布穀鳥(Cuckoo)是一個小型(3億參數)的信息提取(IE)模型,它模仿大語言模型的下一個標記預測範式。該模型通過在給定的輸入上下文中標記來預測下一個標記,而不是從詞彙表中檢索。這使得它能夠利用任何文本資源來提升自身性能,尤其可以藉助為大語言模型整理的數據實現高效學習。
🚀 快速開始
快速體驗下一個標記提取
我們推薦使用最強的超級彩虹布穀鳥模型(Cuckoo-C4-Super-Rainbow)進行零樣本提取。
1️⃣ 首先加載模型和分詞器
from transformers import AutoModelForTokenClassification, AutoTokenizer
import torch
import spacy
nlp = spacy.load("en_core_web_sm")
device = torch.device("cuda:0")
path = f"KomeijiForce/Cuckoo-C4-Super-Rainbow"
tokenizer = AutoTokenizer.from_pretrained(path)
tagger = AutoModelForTokenClassification.from_pretrained(path).to(device)
2️⃣ 定義下一個標記提取函數
def next_tokens_extraction(text):
def find_sequences(lst):
sequences = []
i = 0
while i < len(lst):
if lst[i] == 0:
start = i
end = i
i += 1
while i < len(lst) and lst[i] == 1:
end = i
i += 1
sequences.append((start, end+1))
else:
i += 1
return sequences
text = " ".join([token.text for token in nlp(text)])
inputs = tokenizer(text, return_tensors="pt").to(device)
tag_predictions = tagger(**inputs).logits[0].argmax(-1)
predictions = [tokenizer.decode(inputs.input_ids[0, seq[0]:seq[1]]).strip() for seq in find_sequences(tag_predictions)]
return predictions
3️⃣ 調用函數進行提取!
基礎用法
text = "Tom and Jack went to their trip in Paris."
for question in [
"What is the person mentioned here?",
"What is the city mentioned here?",
"Who goes with Tom together?",
"What do Tom and Jack go to Paris for?",
"Where does George live in?",
]:
prompt = f"User:\n\n{text}\n\nQuestion: {question}\n\nAssistant:"
predictions = next_tokens_extraction(prompt)
print(question, predictions)
你將得到類似如下的結果:
What is the person mentioned here? ['Tom', 'Jack']
What is the city mentioned here? ['Paris']
Who goes with Tom together? ['Jack']
What do Tom and Jack go to Paris for? ['trip']
Where does George live in? []
其中 []
表示布穀鳥模型認為沒有可提取的下一個標記。
高級用法
passage = f'''Ludwig van Beethoven (17 December 1770 – 26 March 1827) was a German composer and pianist. He is one of the most revered figures in the history of Western music; his works rank among the most performed of the classical music repertoire and span the transition from the Classical period to the Romantic era in classical music. His early period, during which he forged his craft, is typically considered to have lasted until 1802. From 1802 to around 1812, his middle period showed an individual development from the styles of Joseph Haydn and Wolfgang Amadeus Mozart, and is sometimes characterised as heroic. During this time, Beethoven began to grow increasingly deaf. In his late period, from 1812 to 1827, he extended his innovations in musical form and expression.'''
for question in [
"What are the people mentioned here?",
"What is the job of Beethoven?",
"How famous is Beethoven?",
"When did Beethoven's middle period showed an individual development?",
]:
text = f"User:\n\n{passage}\n\nQuestion: {question}\n\nAssistant:"
predictions = next_tokens_extraction(text)
print(question, predictions)
你將得到類似如下的結果:
What are the people mentioned here? ['Ludwig van Beethoven', 'Joseph Haydn', 'Wolfgang Amadeus Mozart']
What is the job of Beethoven? ['composer and pianist']
How famous is Beethoven? ['one of the most revered figures in the history of Western music']
When did Beethoven's middle period showed an individual development? ['1802']
小樣本適配
布穀鳥模型在小樣本適配自身任務方面表現出色。以下是一些示例:
- 以CoNLL2003為例:運行命令
bash run_downstream.sh conll2003.5shot KomeijiForce/Cuckoo-C4-Rainbow
,你將在models/cuckoo-conll2003.5shot
中得到一個微調後的模型。然後可以使用腳本python eval_conll2003.py
對模型進行基準測試,其F1性能約為80。 - 機器閱讀理解(SQuAD)適配:運行命令
bash run_downstream.sh squad.32shot KomeijiForce/Cuckoo-C4-Rainbow
,你將在models/cuckoo-squad.32shot
中得到一個微調後的模型。然後使用腳本python eval_squad.py
進行基準測試,F1性能約為88。
如果你要微調自己的任務,需要創建一個Jsonlines文件,每行包含 {"words": [...], "ner": [...]}
,例如:
{"words": ["I", "am", "John", "Smith", ".", "Person", ":"], "ner": ["O", "O", "B", "I", "O", "O", "O"]}
這表示 "John Smith" 被預測為下一個標記。
你可以參考以下提示開始構建自己的下游數據集:
類型 | 用戶輸入 | 助手響應 |
---|---|---|
實體 | 用戶:[上下文] 問題:提到的 [標籤] 是什麼? | 助手:答案:[標籤] 是 |
關係(Kill) | 用戶:[上下文] 問題:[實體] 殺死了誰? | 助手:答案:[實體] 殺死了 |
關係(Live) | 用戶:[上下文] 問題:[實體] 住在哪裡? | 助手:答案:[實體] 住在 |
關係(Work) | 用戶:[上下文] 問題:[實體] 為誰工作? | 助手:答案:[實體] 為 |
關係(Located) | 用戶:[上下文] 問題:[實體] 位於哪裡? | 助手:答案:[實體] 位於 |
關係(Based) | 用戶:[上下文] 問題:[實體] 基於哪裡? | 助手:答案:[實體] 基於 |
關係(Adverse) | 用戶:[上下文] 問題:[實體] 的不良反應是什麼? | 助手:答案:[實體] 的不良反應是 |
查詢 | 用戶:[上下文] 問題:[問題] | 助手:答案: |
指令(實體) | 用戶:[上下文] 問題:提到的 [標籤] 是什麼?([指令]) | 助手:答案:[標籤] 是 |
指令(查詢) | 用戶:[上下文] 問題:[問題]([指令]) | 助手:答案: |
構建好自己的下游數據集後,將其保存為 my_downstream.json
,然後運行命令 bash run_downstream.sh my_downstream KomeijiForce/Cuckoo-C4-Rainbow
。你將在 models/cuckoo-my_downstream
中找到適配後的布穀鳥模型。
訓練自己的布穀鳥模型
我們在 nte_data_collection.py
文件中提供了將文本轉換為NTE實例的腳本,以C4為例,轉換結果可以在 cuckoo.c4.example.json
中查看。該腳本易於適配其他資源,如實體、查詢和問題,你可以將自己的數據修改為NTE格式來訓練自己的布穀鳥模型!運行 run_cuckoo.sh
腳本進行示例預訓練:
python run_ner.py \
--model_name_or_path roberta-large \
--train_file cuckoo.c4.example.json \
--output_dir models/cuckoo-c4-example \
--per_device_train_batch_size 4\
--gradient_accumulation_steps 16\
--num_train_epochs 1\
--save_steps 1000\
--learning_rate 0.00001\
--do_train \
--overwrite_output_dir
你將在 models/cuckoo-c4-example
中得到一個示例布穀鳥模型。如果預訓練數據過少,模型性能可能不佳。你可以調整 nte_data_collection.py
中的超參數或修改轉換邏輯,以實現更好的預訓練性能。
✨ 主要特性
- 創新的信息提取範式:布穀鳥模型模仿大語言模型的下一個標記預測範式,通過在給定輸入上下文中標記來預測下一個標記,與傳統的信息提取預訓練方法有很大不同。
- 數據利用高效:可以利用任何文本資源來提升自身性能,尤其能夠藉助為大語言模型整理的數據實現高效學習。
- 多場景適配能力:支持零樣本提取和小樣本適配,能夠快速適應不同的信息提取任務。
📦 安裝指南
文檔未提供具體安裝步驟,暫不展示。
💻 使用示例
上述快速開始部分已包含詳細的使用示例,此處不再贅述。
📚 詳細文檔
性能展示
布穀鳥模型在多種信息提取任務上表現出色,以下是其與其他模型的性能對比:
CoNLL2003 | BioNLP2004 | MIT-Restaurant | MIT-Movie | 平均 | CoNLL2004 | ADE | 平均 | SQuAD | SQuAD-V2 | DROP | 平均 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
OPT-C4-TuluV3 | 50.24 | 39.76 | 58.91 | 56.33 | 50.56 | 47.14 | 45.66 | 46.40 | 39.80 | 53.81 | 31.00 | 41.54 |
RoBERTa | 33.75 | 32.91 | 62.15 | 58.32 | 46.80 | 34.16 | 2.15 | 18.15 | 31.86 | 48.55 | 9.16 | 29.86 |
MRQA | 72.45 | 55.93 | 68.68 | 66.26 | 65.83 | 66.23 | 67.44 | 66.84 | 80.07 | 66.22 | 54.46 | 66.92 |
MultiNERD | 66.78 | 54.62 | 64.16 | 66.30 | 60.59 | 57.52 | 45.10 | 51.31 | 42.85 | 50.99 | 30.12 | 41.32 |
NuNER | 74.15 | 56.36 | 68.57 | 64.88 | 65.99 | 65.12 | 63.71 | 64.42 | 61.60 | 52.67 | 37.37 | 50.55 |
MetaIE | 71.33 | 55.63 | 70.08 | 65.23 | 65.57 | 64.81 | 64.40 | 64.61 | 74.59 | 62.54 | 30.73 | 55.95 |
布穀鳥模型(Cuckoo 🐦🛠️) | 73.60 | 57.00 | 67.63 | 67.12 | 66.34 | 69.57 | 71.70 | 70.63 | 77.47 | 64.06 | 54.25 | 65.26 |
└─ 僅預訓練(Only Pre-train 🐦) | 72.46 | 55.87 | 66.87 | 67.23 | 65.61 | 68.14 | 69.39 | 68.77 | 75.64 | 63.36 | 52.81 | 63.94 |
└─ 僅後訓練(Only Post-train) | 72.80 | 56.10 | 66.02 | 67.10 | 65.51 | 68.66 | 69.75 | 69.21 | 77.05 | 62.39 | 54.80 | 64.75 |
彩虹布穀鳥模型(Rainbow Cuckoo 🌈🐦🛠️) | 79.94 | 58.39 | 70.30 | 67.00 | 68.91 | 70.47 | 76.05 | 73.26 | 86.57 | 69.41 | 64.64 | 73.54 |
可用的預訓練模型
目前開源的布穀鳥模型檢查點基於以下數據進行預訓練:
- 從C4轉換而來的1億個下一個標記提取(NTE)實例。(Cuckoo-C4 🐦)
- Cuckoo-C4 + 從有監督微調數據集TuluV3轉換而來的260萬個下一個標記提取(NTE)實例。(Cuckoo-C4-Instruct 🐦🛠️)
- Cuckoo-C4-Instruct + MultiNERD、MetaIE、NuNER、MRQA(不包括SQuAD、DROP)。(Cuckoo-C4-Rainbow 🌈🐦🛠️)
- Cuckoo-C4-Rainbow + 多個NER數據集、WizardLM數據集、多項選擇問答數據集、MMLU、SQuAD、DROP、MNLI、SNLI。(Cuckoo-C4-Super-Rainbow 🦸🌈🐦🛠️)
🔧 技術細節
文檔未提供具體技術細節,暫不展示。
📄 許可證
本項目使用Apache-2.0許可證。
🐾 引用
@article{DBLP:journals/corr/abs-2502-11275,
author = {Letian Peng and
Zilong Wang and
Feng Yao and
Jingbo Shang},
title = {Cuckoo: An {IE} Free Rider Hatched by Massive Nutrition in {LLM}'s Nest},
journal = {CoRR},
volume = {abs/2502.11275},
year = {2025},
url = {https://doi.org/10.48550/arXiv.2502.11275},
doi = {10.48550/arXiv.2502.11275},
eprinttype = {arXiv},
eprint = {2502.11275},
timestamp = {Mon, 17 Feb 2025 19:32:20 +0000},
biburl = {https://dblp.org/rec/journals/corr/abs-2502-11275.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}



