開源Cuckoo-C4-Instruct模型 - 像大語言模型一樣高效抽取信息！

首頁

Cuckoo C4 Instruct

由KomeijiForce開發

超級彩虹布穀鳥是一個基於下一詞抽取(NTE)範式的小型信息抽取模型，通過模仿大語言模型的預測方式實現高效信息抽取。

問答系統

Transformers

開源協議:MIT #信息抽取 #小樣本適應 #問答系統

下載量 17

發布時間 : 2/16/2025

模型概述

布穀鳥模型是一個3億參數的小型信息抽取(IE)模型，它創新性地採用下一詞預測範式進行信息抽取。與傳統的詞表檢索不同，布穀鳥通過標記給定上下文中的下一詞進行預測，能夠利用各種文本資源進行自我增強。

模型特點

下一詞抽取範式

創新性地模仿大語言模型的下一詞預測方式，通過標記上下文中的下一詞進行信息抽取

自我增強能力

能夠利用任何文本資源進行自我增強，特別是通過大語言模型的數據準備

高效適應

在小樣本場景下表現出優異的適應能力，可快速適應特定任務

多任務集成

整合了多種信息抽取任務的數據集，包括NER、QA等

模型能力

命名實體識別

關係抽取

問答系統

信息抽取

小樣本學習

使用案例

知識抽取

實體識別

從文本中識別命名實體如人名、地名等

在CoNLL2003上F1達88.38

關係抽取

識別實體間的關係如居住地、工作單位等

問答系統

閱讀理解

從給定文本中抽取問題答案

在SQuAD上F1達89.54

🚀 布穀鳥模型（Cuckoo）🐦

布穀鳥（Cuckoo）系列模型是抽取式問答模型，可有效解決信息抽取任務中的問題，為相關領域的研究和應用提供了強大的支持。

🚀 快速開始

布穀鳥（Cuckoo）系列模型是如論文 Cuckoo: An IE Free Rider Hatched by Massive Nutrition in LLM's Nest 中所描述的抽取式問答模型。

布穀鳥是一個小型（3億參數）的信息抽取（IE）模型，它模仿大語言模型的下一個詞預測範式。與從詞彙表中檢索不同，布穀鳥通過在給定的輸入上下文中標記來預測下一個詞，如下所示：

布穀鳥

布穀鳥與之前的信息抽取預訓練有很大不同，因為它可以使用任何文本資源來提升自身能力，特別是可以藉助為大語言模型精心整理的數據！

目前，我們開源了在以下數據上預訓練的布穀鳥模型檢查點：

從C4轉換而來的1億個下一個詞抽取（NTE）實例。(布穀鳥 - C4 🐦)
布穀鳥 - C4 + 從有監督微調數據集TuluV3轉換而來的260萬個下一個詞抽取（NTE）實例。(布穀鳥 - C4 - 指令 🐦🛠️)
布穀鳥 - C4 - 指令 + MultiNERD、MetaIE、NuNER、MRQA（不包括SQuAD、DROP）。(布穀鳥 - C4 - 彩虹 🌈🐦🛠️)
布穀鳥 - C4 - 彩虹 + 多個命名實體識別（NER）數據集、WizardLM數據集、多項選擇問答數據集、MMLU、SQuAD、DROP、MNLI、SNLI。(布穀鳥 - C4 - 超級彩虹 🦸🌈🐦🛠️)

✨ 主要特性

性能展示 🚀

開啟你的布穀鳥之旅，體驗它在各種信息抽取任務中難以想象的適配效率！

模型	CoNLL2003	BioNLP2004	MIT - 餐廳	MIT - 電影	平均	CoNLL2004	ADE	平均	SQuAD	SQuAD - V2	DROP	平均
OPT - C4 - TuluV3	50.24	39.76	58.91	56.33	50.56	47.14	45.66	46.40	39.80	53.81	31.00	41.54
RoBERTa	33.75	32.91	62.15	58.32	46.80	34.16	2.15	18.15	31.86	48.55	9.16	29.86
MRQA	72.45	55.93	68.68	66.26	65.83	66.23	67.44	66.84	80.07	66.22	54.46	66.92
MultiNERD	66.78	54.62	64.16	66.30	60.59	57.52	45.10	51.31	42.85	50.99	30.12	41.32
NuNER	74.15	56.36	68.57	64.88	65.99	65.12	63.71	64.42	61.60	52.67	37.37	50.55
MetaIE	71.33	55.63	70.08	65.23	65.57	64.81	64.40	64.61	74.59	62.54	30.73	55.95
布穀鳥 🐦🛠️	73.60	57.00	67.63	67.12	66.34	69.57	71.70	70.63	77.47	64.06	54.25	65.26
└─ 僅預訓練 🐦	72.46	55.87	66.87	67.23	65.61	68.14	69.39	68.77	75.64	63.36	52.81	63.94
└─ 僅後訓練	72.80	56.10	66.02	67.10	65.51	68.66	69.75	69.21	77.05	62.39	54.80	64.75
彩虹布穀鳥 🌈🐦🛠️	79.94	58.39	70.30	67.00	68.91	70.47	76.05	73.26	86.57	69.41	64.64	73.54

(超級彩虹布穀鳥 🦸🌈🐦🛠️ 使用除CoNLL2004和ADE之外的訓練集來提升其性能)

模型	CoNLL2003	BioNLP2004	MIT - 餐廳	MIT - 電影	平均	CoNLL2004	ADE	平均	SQuAD	SQuAD - V2	DROP	平均
超級彩虹布穀鳥 🦸🌈🐦🛠️	88.38	68.33	76.79	69.39	75.22	72.96	80.06	76.51	89.54	74.52	74.89	79.65

💻 使用示例

快速體驗布穀鳥在下一個詞抽取中的應用 ⚡

我們建議使用最強的超級彩虹布穀鳥 🦸🌈🐦🛠️ 進行零樣本抽取。你可以直接在 case_next_tokens_extraction.py 中運行以下示例。

基礎用法

# 首先加載模型和分詞器
from transformers import AutoModelForTokenClassification, AutoTokenizer
import torch
import spacy

nlp = spacy.load("en_core_web_sm")

device = torch.device("cuda:0")
path = f"KomeijiForce/Cuckoo-C4-Super-Rainbow"
tokenizer = AutoTokenizer.from_pretrained(path)
tagger = AutoModelForTokenClassification.from_pretrained(path).to(device)

# 定義下一個詞抽取函數
def next_tokens_extraction(text):

    def find_sequences(lst):
        sequences = []
        i = 0
        while i < len(lst):
            if lst[i] == 0:
                start = i
                end = i
                i += 1
                while i < len(lst) and lst[i] == 1:
                    end = i
                    i += 1
                sequences.append((start, end+1))
            else:
                i += 1
        return sequences

    text = " ".join([token.text for token in nlp(text)])

    inputs = tokenizer(text, return_tensors="pt").to(device)
    tag_predictions = tagger(**inputs).logits[0].argmax(-1)

    predictions = [tokenizer.decode(inputs.input_ids[0, seq[0]:seq[1]]).strip() for seq in find_sequences(tag_predictions)]
    
    return predictions

高級用法

# 調用函數進行抽取！
# 案例1: 基本實體和關係理解
text = "Tom and Jack went to their trip in Paris."

for question in [
    "What is the person mentioned here?",
    "What is the city mentioned here?",
    "Who goes with Tom together?",
    "What do Tom and Jack go to Paris for?",
    "Where does George live in?",
]:
    prompt = f"User:\n\n{text}\n\nQuestion: {question}\n\nAssistant:"
    predictions = next_tokens_extraction(prompt)
    print(question, predictions)

# 案例2: 更長的上下文
passage = f'''Ludwig van Beethoven (17 December 1770 – 26 March 1827) was a German composer and pianist. He is one of the most revered figures in the history of Western music; his works rank among the most performed of the classical music repertoire and span the transition from the Classical period to the Romantic era in classical music. His early period, during which he forged his craft, is typically considered to have lasted until 1802. From 1802 to around 1812, his middle period showed an individual development from the styles of Joseph Haydn and Wolfgang Amadeus Mozart, and is sometimes characterised as heroic. During this time, Beethoven began to grow increasingly deaf. In his late period, from 1812 to 1827, he extended his innovations in musical form and expression.'''

for question in [
    "What are the people mentioned here?",
    "What is the job of Beethoven?",
    "How famous is Beethoven?",
    "When did Beethoven's middle period showed an individual development?",
]:
    text = f"User:\n\n{passage}\n\nQuestion: {question}\n\nAssistant:"
    predictions = next_tokens_extraction(text)
    print(question, predictions)

# 案例3: 知識問答
for obj in ["grass", "sea", "fire", "night"]:
    text = f"User:\n\nChoices:\nred\nblue\ngreen.\n\nQuestion: What is the color of the {obj}?\n\nAssistant:\n\nAnswer:"
    predictions = next_tokens_extraction(text)
    print(obj, predictions)

少樣本適配 🎯

布穀鳥 🐦 在對自己的任務進行少樣本適配方面是專家，以CoNLL2003為例，運行 bash run_downstream.sh conll2003.5shot KomeijiForce/Cuckoo-C4-Rainbow，你將在 models/cuckoo-conll2003.5shot 中得到一個微調後的模型。然後你可以使用腳本 python eval_conll2003.py 對模型進行基準測試，它將顯示大約80的F1性能。

你也可以訓練對機器閱讀理解（SQuAD）的適配，運行 bash run_downstream.sh squad.32shot KomeijiForce/Cuckoo-C4-Rainbow，你將在 models/cuckoo-squad.32shot 中得到一個微調後的模型。然後你可以使用腳本 python eval_squad.py 對模型進行基準測試，它將顯示大約88的F1性能。

要微調你自己的任務，你需要創建一個Jsonlines文件，每行包含 {"words": [...], "ner": [...]}，例如：

{"words": ["I", "am", "John", "Smith", ".", "Person", ":"], "ner": ["O", "O", "B", "I", "O", "O", "O"]}

這表明 "John Smith" 被預測為下一個詞。

你可以參考以下一些提示來開始：

類型	用戶輸入	助手響應
實體	用戶: [上下文] 問題: 提到的 [標籤] 是什麼?	助手: 答案: [標籤] 是
關係（殺死）	用戶: [上下文] 問題: [實體] 殺死了誰?	助手: 答案: [實體] 殺死了
關係（居住）	用戶: [上下文] 問題: [實體] 住在哪裡?	助手: 答案: [實體] 住在
關係（工作）	用戶: [上下文] 問題: [實體] 為誰工作?	助手: 答案: [實體] 為
關係（位於）	用戶: [上下文] 問題: [實體] 位於哪裡?	助手: 答案: [實體] 位於
關係（基於）	用戶: [上下文] 問題: [實體] 基於哪裡?	助手: 答案: [實體] 基於
關係（不良影響）	用戶: [上下文] 問題: [實體] 的不良影響是什麼?	助手: 答案: [實體] 的不良影響是
查詢	用戶: [上下文] 問題: [問題]	助手: 答案:
指令（實體）	用戶: [上下文] 問題: 提到的 [標籤] 是什麼? ([指令])	助手: 答案: [標籤] 是
指令（查詢）	用戶: [上下文] 問題: [問題] ([指令])	助手: 答案:

構建自己的下游數據集後，將其保存到 my_downstream.json 中，然後運行命令 bash run_downstream.sh my_downstream KomeijiForce/Cuckoo-C4-Rainbow。你將在 models/cuckoo-my_downstream 中找到一個適配後的布穀鳥模型。

放飛你自己的布穀鳥 🪽

我們在文件 nte_data_collection.py 中包含了將文本轉換為NTE實例的腳本，該腳本以C4為例，轉換結果可以在 cuckoo.c4.example.json 中查看。該腳本旨在易於適配其他資源，如實體、查詢和問題，你可以將自己的數據修改為NTE以放飛你自己的布穀鳥！運行 run_cuckoo.sh 腳本來嘗試一個示例預訓練。

python run_ner.py \
  --model_name_or_path roberta-large \
  --train_file cuckoo.c4.example.json \
  --output_dir models/cuckoo-c4-example \
  --per_device_train_batch_size 4\
  --gradient_accumulation_steps 16\
  --num_train_epochs 1\
  --save_steps 1000\
  --learning_rate 0.00001\
  --do_train \
  --overwrite_output_dir

你將在 models/cuckoo-c4-example 中得到一個示例布穀鳥模型，如果你用太少的數據進行預訓練，它的性能可能不會很好。你可以調整 nte_data_collection.py 中的超參數，或者修改轉換以適配你自己的資源，以實現更好的預訓練性能。

📚 詳細文檔

🐾 引用

@article{DBLP:journals/corr/abs-2502-11275,
  author       = {Letian Peng and
                  Zilong Wang and
                  Feng Yao and
                  Jingbo Shang},
  title        = {Cuckoo: An {IE} Free Rider Hatched by Massive Nutrition in {LLM}'s Nest},
  journal      = {CoRR},
  volume       = {abs/2502.11275},
  year         = {2025},
  url          = {https://doi.org/10.48550/arXiv.2502.11275},
  doi          = {10.48550/arXiv.2502.11275},
  eprinttype   = {arXiv},
  eprint       = {2502.11275},
  timestamp    = {Mon, 17 Feb 2025 19:32:20 +0000},
  biburl       = {https://dblp.org/rec/journals/corr/abs-2502-11275.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}