模型概述
模型特點
模型能力
使用案例
🚀 布穀鳥模型(Cuckoo)🐦
布穀鳥模型(Cuckoo)是一個小型(3億參數)的信息提取(IE)模型,它模仿大語言模型的下一個標記預測範式。與從詞彙表中檢索不同,Cuckoo通過在給定的輸入上下文中標記下一個標記來進行預測。該模型能夠利用任何文本資源來提升自身性能,尤其可以藉助為大語言模型整理的數據來實現能力增強。
🚀 快速開始
目前,我們開源了在不同數據集上預訓練的布穀鳥模型檢查點:
- 從C4轉換而來的1億個下一個標記提取(NTE)實例。(Cuckoo-C4 🐦)
- Cuckoo-C4 + 從有監督微調數據集TuluV3轉換而來的260萬個下一個標記提取(NTE)實例。(Cuckoo-C4-Instruct 🐦🛠️)
- Cuckoo-C4-Instruct + MultiNERD、MetaIE、NuNER、MRQA(不包括SQuAD、DROP)。(Cuckoo-C4-Rainbow 🌈🐦🛠️)
- Cuckoo-C4-Rainbow + 多個命名實體識別(NER)數據集、WizardLM數據集、多項選擇問答數據集、MMLU、SQuAD、DROP、MNLI、SNLI。(Cuckoo-C4-Super-Rainbow 🦸🌈🐦🛠️)
✨ 主要特性
- 模仿預測範式:模仿大語言模型的下一個標記預測範式,通過在輸入上下文中標記下一個標記來進行預測。
- 數據利用高效:可以使用任何文本資源來增強自身,特別是藉助為大語言模型整理的數據。
- 多場景適應性:在多種信息提取任務中表現出色,支持少樣本適應。
📊 性能展示
以下是布穀鳥模型在不同數據集上的性能表現,與其他模型進行了對比:
CoNLL2003 | BioNLP2004 | MIT-Restaurant | MIT-Movie | Avg. | CoNLL2004 | ADE | Avg. | SQuAD | SQuAD-V2 | DROP | Avg. | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
OPT-C4-TuluV3 | 50.24 | 39.76 | 58.91 | 56.33 | 50.56 | 47.14 | 45.66 | 46.40 | 39.80 | 53.81 | 31.00 | 41.54 |
RoBERTa | 33.75 | 32.91 | 62.15 | 58.32 | 46.80 | 34.16 | 2.15 | 18.15 | 31.86 | 48.55 | 9.16 | 29.86 |
MRQA | 72.45 | 55.93 | 68.68 | 66.26 | 65.83 | 66.23 | 67.44 | 66.84 | 80.07 | 66.22 | 54.46 | 66.92 |
MultiNERD | 66.78 | 54.62 | 64.16 | 66.30 | 60.59 | 57.52 | 45.10 | 51.31 | 42.85 | 50.99 | 30.12 | 41.32 |
NuNER | 74.15 | 56.36 | 68.57 | 64.88 | 65.99 | 65.12 | 63.71 | 64.42 | 61.60 | 52.67 | 37.37 | 50.55 |
MetaIE | 71.33 | 55.63 | 70.08 | 65.23 | 65.57 | 64.81 | 64.40 | 64.61 | 74.59 | 62.54 | 30.73 | 55.95 |
Cuckoo 🐦🛠️ | 73.60 | 57.00 | 67.63 | 67.12 | 66.34 | 69.57 | 71.70 | 70.63 | 77.47 | 64.06 | 54.25 | 65.26 |
└─ Only Pre-train 🐦 | 72.46 | 55.87 | 66.87 | 67.23 | 65.61 | 68.14 | 69.39 | 68.77 | 75.64 | 63.36 | 52.81 | 63.94 |
└─ Only Post-train | 72.80 | 56.10 | 66.02 | 67.10 | 65.51 | 68.66 | 69.75 | 69.21 | 77.05 | 62.39 | 54.80 | 64.75 |
Rainbow Cuckoo 🌈🐦🛠️ | 79.94 | 58.39 | 70.30 | 67.00 | 68.91 | 70.47 | 76.05 | 73.26 | 86.57 | 69.41 | 64.64 | 73.54 |
💻 使用示例
基礎用法
我們推薦使用最強的超級彩虹布穀鳥模型(Cuckoo-C4-Super-Rainbow)進行零樣本提取。以下是具體步驟:
- 加載模型和分詞器
from transformers import AutoModelForTokenClassification, AutoTokenizer
import torch
import spacy
nlp = spacy.load("en_core_web_sm")
device = torch.device("cuda:0")
path = f"KomeijiForce/Cuckoo-C4-Super-Rainbow"
tokenizer = AutoTokenizer.from_pretrained(path)
tagger = AutoModelForTokenClassification.from_pretrained(path).to(device)
- 定義下一個標記提取函數
def next_tokens_extraction(text):
def find_sequences(lst):
sequences = []
i = 0
while i < len(lst):
if lst[i] == 0:
start = i
end = i
i += 1
while i < len(lst) and lst[i] == 1:
end = i
i += 1
sequences.append((start, end+1))
else:
i += 1
return sequences
text = " ".join([token.text for token in nlp(text)])
inputs = tokenizer(text, return_tensors="pt").to(device)
tag_predictions = tagger(**inputs).logits[0].argmax(-1)
predictions = [tokenizer.decode(inputs.input_ids[0, seq[0]:seq[1]]).strip() for seq in find_sequences(tag_predictions)]
return predictions
- 調用函數進行提取 以下是不同場景下的調用示例:
- 基本實體和關係理解
text = "Tom and Jack went to their trip in Paris."
for question in [
"What is the person mentioned here?",
"What is the city mentioned here?",
"Who goes with Tom together?",
"What do Tom and Jack go to Paris for?",
"Where does George live in?",
]:
prompt = f"User:\n\n{text}\n\nQuestion: {question}\n\nAssistant:"
predictions = next_tokens_extraction(prompt)
print(question, predictions)
- 更長上下文處理
passage = f'''Ludwig van Beethoven (17 December 1770 – 26 March 1827) was a German composer and pianist. He is one of the most revered figures in the history of Western music; his works rank among the most performed of the classical music repertoire and span the transition from the Classical period to the Romantic era in classical music. His early period, during which he forged his craft, is typically considered to have lasted until 1802. From 1802 to around 1812, his middle period showed an individual development from the styles of Joseph Haydn and Wolfgang Amadeus Mozart, and is sometimes characterised as heroic. During this time, Beethoven began to grow increasingly deaf. In his late period, from 1812 to 1827, he extended his innovations in musical form and expression.'''
for question in [
"What are the people mentioned here?",
"What is the job of Beethoven?",
"How famous is Beethoven?",
"When did Beethoven's middle period showed an individual development?",
]:
text = f"User:\n\n{passage}\n\nQuestion: {question}\n\nAssistant:"
predictions = next_tokens_extraction(text)
print(question, predictions)
- 知識問答
for obj in ["grass", "sea", "fire", "night"]:
text = f"User:\n\nChoices:\nred\nblue\ngreen.\n\nQuestion: What is the color of the {obj}?\n\nAssistant:\n\nAnswer:"
predictions = next_tokens_extraction(text)
print(obj, predictions)
🎯 少樣本適應
布穀鳥模型在少樣本適應自身任務方面表現出色。以下是一些示例:
- 以CoNLL2003為例:運行
bash run_downstream.sh conll2003.5shot KomeijiForce/Cuckoo-C4-Rainbow
,你將在models/cuckoo-conll2003.5shot
中得到一個微調後的模型。然後可以使用腳本python eval_conll2003.py
對模型進行基準測試,F1性能約為80。 - 機器閱讀理解(SQuAD):運行
bash run_downstream.sh squad.32shot KomeijiForce/Cuckoo-C4-Rainbow
,你將在models/cuckoo-squad.32shot
中得到一個微調後的模型。然後可以使用腳本python eval_squad.py
對模型進行基準測試,F1性能約為88。
若要對自己的任務進行微調,需要創建一個Jsonlines文件,每行包含 {"words": [...], "ner": [...]},例如:
{"words": ["I", "am", "John", "Smith", ".", "Person", ":"], "ner": ["O", "O", "B", "I", "O", "O", "O"]}
創建好自己的下游數據集後,將其保存為 my_downstream.json
,然後運行命令 bash run_downstream.sh my_downstream KomeijiForce/Cuckoo-C4-Rainbow
。你將在 models/cuckoo-my_downstream
中找到適應後的布穀鳥模型。
🪽 訓練自己的布穀鳥模型
我們在 nte_data_collection.py
文件中包含了將文本轉換為NTE實例的腳本,以C4為例,轉換結果可以在 cuckoo.c4.example.json
中查看。該腳本易於適應其他資源,如實體、查詢和問題,你可以將自己的數據修改為NTE格式來訓練自己的布穀鳥模型!運行 run_cuckoo.sh
腳本嘗試一個示例預訓練:
python run_ner.py \
--model_name_or_path roberta-large \
--train_file cuckoo.c4.example.json \
--output_dir models/cuckoo-c4-example \
--per_device_train_batch_size 4\
--gradient_accumulation_steps 16\
--num_train_epochs 1\
--save_steps 1000\
--learning_rate 0.00001\
--do_train \
--overwrite_output_dir
你將在 models/cuckoo-c4-example
中得到一個示例布穀鳥模型。如果預訓練數據太少,模型性能可能不佳。你可以調整 nte_data_collection.py
中的超參數或修改轉換邏輯以適應自己的資源,從而獲得更好的預訓練性能。
🐾 引用
@article{DBLP:journals/corr/abs-2502-11275,
author = {Letian Peng and
Zilong Wang and
Feng Yao and
Jingbo Shang},
title = {Cuckoo: An {IE} Free Rider Hatched by Massive Nutrition in {LLM}'s Nest},
journal = {CoRR},
volume = {abs/2502.11275},
year = {2025},
url = {https://doi.org/10.48550/arXiv.2502.11275},
doi = {10.48550/arXiv.2502.11275},
eprinttype = {arXiv},
eprint = {2502.11275},
timestamp = {Mon, 17 Feb 2025 19:32:20 +0000},
biburl = {https://dblp.org/rec/journals/corr/abs-2502-11275.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
📄 許可證
本項目採用Apache 2.0許可證。







