Cuckoo-C4-Super-Rainbow開源信息提取模型 - 利用文本資源增強，高效提取信息

首頁

Cuckoo C4 Super Rainbow

由KomeijiForce開發

布穀鳥是一個3億參數的信息提取模型，通過模仿大語言模型的下一詞元預測範式進行信息提取，能夠利用各種文本資源進行自我增強。

大型語言模型

Transformers

開源協議:Apache-2.0 #信息提取自由騎士 #小樣本適應 #下一詞元預測

下載量 159

發布時間 : 2/16/2025

模型概述

布穀鳥模型是一個小型但高效的信息提取模型，它通過標記給定上下文中的下一詞元進行預測，不同於傳統的詞表檢索方法。該模型特別擅長利用為大語言模型準備的數據進行自我增強。

模型特點

自我增強能力

能夠利用任何文本資源進行自我增強，尤其擅長利用為大語言模型準備的數據。

小樣本適應

擅長小樣本適應，在少量標註數據下也能取得良好性能。

多任務處理

能夠處理多種信息提取任務，包括實體識別、關係抽取等。

模型能力

實體識別

關係抽取

下一詞元預測

信息提取

小樣本學習

使用案例

文本理解

人物和地點識別

從文本中識別提到的人物和地點

示例輸出：提到的人物有哪些？ ['湯姆', '傑克']

事件理解

理解文本中描述的事件和活動

示例輸出：湯姆和傑克去巴黎的目的？ ['旅行']

知識問答

屬性查詢

回答關於對象屬性的簡單問題

示例輸出：草 ['綠色']

🚀 布穀鳥模型（Cuckoo）

布穀鳥（Cuckoo）是一個小型（3億參數）的信息提取（IE）模型，它模仿大語言模型的下一個標記預測範式。該模型通過在給定的輸入上下文中標記來預測下一個標記，而不是從詞彙表中檢索。這使得它能夠利用任何文本資源來提升自身性能，尤其可以藉助為大語言模型整理的數據實現高效學習。

🚀 快速開始

快速體驗下一個標記提取

我們推薦使用最強的超級彩虹布穀鳥模型（Cuckoo-C4-Super-Rainbow）進行零樣本提取。

1️⃣ 首先加載模型和分詞器

from transformers import AutoModelForTokenClassification, AutoTokenizer
import torch
import spacy

nlp = spacy.load("en_core_web_sm")

device = torch.device("cuda:0")
path = f"KomeijiForce/Cuckoo-C4-Super-Rainbow"
tokenizer = AutoTokenizer.from_pretrained(path)
tagger = AutoModelForTokenClassification.from_pretrained(path).to(device)

2️⃣ 定義下一個標記提取函數

def next_tokens_extraction(text):

    def find_sequences(lst):
        sequences = []
        i = 0
        while i < len(lst):
            if lst[i] == 0:
                start = i
                end = i
                i += 1
                while i < len(lst) and lst[i] == 1:
                    end = i
                    i += 1
                sequences.append((start, end+1))
            else:
                i += 1
        return sequences

    text = " ".join([token.text for token in nlp(text)])

    inputs = tokenizer(text, return_tensors="pt").to(device)
    tag_predictions = tagger(**inputs).logits[0].argmax(-1)

    predictions = [tokenizer.decode(inputs.input_ids[0, seq[0]:seq[1]]).strip() for seq in find_sequences(tag_predictions)]
    
    return predictions

3️⃣ 調用函數進行提取！

基礎用法

text = "Tom and Jack went to their trip in Paris."

for question in [
    "What is the person mentioned here?",
    "What is the city mentioned here?",
    "Who goes with Tom together?",
    "What do Tom and Jack go to Paris for?",
    "Where does George live in?",
]:
    prompt = f"User:\n\n{text}\n\nQuestion: {question}\n\nAssistant:"
    predictions = next_tokens_extraction(prompt)
    print(question, predictions)

你將得到類似如下的結果：

What is the person mentioned here? ['Tom', 'Jack']
What is the city mentioned here? ['Paris']
Who goes with Tom together? ['Jack']
What do Tom and Jack go to Paris for? ['trip']
Where does George live in? []

其中 [] 表示布穀鳥模型認為沒有可提取的下一個標記。

高級用法

passage = f'''Ludwig van Beethoven (17 December 1770 – 26 March 1827) was a German composer and pianist. He is one of the most revered figures in the history of Western music; his works rank among the most performed of the classical music repertoire and span the transition from the Classical period to the Romantic era in classical music. His early period, during which he forged his craft, is typically considered to have lasted until 1802. From 1802 to around 1812, his middle period showed an individual development from the styles of Joseph Haydn and Wolfgang Amadeus Mozart, and is sometimes characterised as heroic. During this time, Beethoven began to grow increasingly deaf. In his late period, from 1812 to 1827, he extended his innovations in musical form and expression.'''

for question in [
    "What are the people mentioned here?",
    "What is the job of Beethoven?",
    "How famous is Beethoven?",
    "When did Beethoven's middle period showed an individual development?",
]:
    text = f"User:\n\n{passage}\n\nQuestion: {question}\n\nAssistant:"
    predictions = next_tokens_extraction(text)
    print(question, predictions)

你將得到類似如下的結果：

What are the people mentioned here? ['Ludwig van Beethoven', 'Joseph Haydn', 'Wolfgang Amadeus Mozart']
What is the job of Beethoven? ['composer and pianist']
How famous is Beethoven? ['one of the most revered figures in the history of Western music']
When did Beethoven's middle period showed an individual development? ['1802']

小樣本適配

布穀鳥模型在小樣本適配自身任務方面表現出色。以下是一些示例：

以CoNLL2003為例：運行命令 bash run_downstream.sh conll2003.5shot KomeijiForce/Cuckoo-C4-Rainbow，你將在 models/cuckoo-conll2003.5shot 中得到一個微調後的模型。然後可以使用腳本 python eval_conll2003.py 對模型進行基準測試，其F1性能約為80。
機器閱讀理解（SQuAD）適配：運行命令 bash run_downstream.sh squad.32shot KomeijiForce/Cuckoo-C4-Rainbow，你將在 models/cuckoo-squad.32shot 中得到一個微調後的模型。然後使用腳本 python eval_squad.py 進行基準測試，F1性能約為88。

如果你要微調自己的任務，需要創建一個Jsonlines文件，每行包含 {"words": [...], "ner": [...]}，例如：

{"words": ["I", "am", "John", "Smith", ".", "Person", ":"], "ner": ["O", "O", "B", "I", "O", "O", "O"]}

這表示 "John Smith" 被預測為下一個標記。

你可以參考以下提示開始構建自己的下游數據集：

類型	用戶輸入	助手響應
實體	用戶：[上下文] 問題：提到的 [標籤] 是什麼？	助手：答案：[標籤] 是
關係（Kill）	用戶：[上下文] 問題：[實體] 殺死了誰？	助手：答案：[實體] 殺死了
關係（Live）	用戶：[上下文] 問題：[實體] 住在哪裡？	助手：答案：[實體] 住在
關係（Work）	用戶：[上下文] 問題：[實體] 為誰工作？	助手：答案：[實體] 為
關係（Located）	用戶：[上下文] 問題：[實體] 位於哪裡？	助手：答案：[實體] 位於
關係（Based）	用戶：[上下文] 問題：[實體] 基於哪裡？	助手：答案：[實體] 基於
關係（Adverse）	用戶：[上下文] 問題：[實體] 的不良反應是什麼？	助手：答案：[實體] 的不良反應是
查詢	用戶：[上下文] 問題：[問題]	助手：答案：
指令（實體）	用戶：[上下文] 問題：提到的 [標籤] 是什麼？（[指令]）	助手：答案：[標籤] 是
指令（查詢）	用戶：[上下文] 問題：[問題]（[指令]）	助手：答案：

構建好自己的下游數據集後，將其保存為 my_downstream.json，然後運行命令 bash run_downstream.sh my_downstream KomeijiForce/Cuckoo-C4-Rainbow。你將在 models/cuckoo-my_downstream 中找到適配後的布穀鳥模型。

訓練自己的布穀鳥模型

我們在 nte_data_collection.py 文件中提供了將文本轉換為NTE實例的腳本，以C4為例，轉換結果可以在 cuckoo.c4.example.json 中查看。該腳本易於適配其他資源，如實體、查詢和問題，你可以將自己的數據修改為NTE格式來訓練自己的布穀鳥模型！運行 run_cuckoo.sh 腳本進行示例預訓練：

python run_ner.py \
  --model_name_or_path roberta-large \
  --train_file cuckoo.c4.example.json \
  --output_dir models/cuckoo-c4-example \
  --per_device_train_batch_size 4\
  --gradient_accumulation_steps 16\
  --num_train_epochs 1\
  --save_steps 1000\
  --learning_rate 0.00001\
  --do_train \
  --overwrite_output_dir

你將在 models/cuckoo-c4-example 中得到一個示例布穀鳥模型。如果預訓練數據過少，模型性能可能不佳。你可以調整 nte_data_collection.py 中的超參數或修改轉換邏輯，以實現更好的預訓練性能。

✨ 主要特性

創新的信息提取範式：布穀鳥模型模仿大語言模型的下一個標記預測範式，通過在給定輸入上下文中標記來預測下一個標記，與傳統的信息提取預訓練方法有很大不同。
數據利用高效：可以利用任何文本資源來提升自身性能，尤其能夠藉助為大語言模型整理的數據實現高效學習。
多場景適配能力：支持零樣本提取和小樣本適配，能夠快速適應不同的信息提取任務。

📦 安裝指南

文檔未提供具體安裝步驟，暫不展示。

💻 使用示例

上述快速開始部分已包含詳細的使用示例，此處不再贅述。

📚 詳細文檔

性能展示

布穀鳥模型在多種信息提取任務上表現出色，以下是其與其他模型的性能對比：

	CoNLL2003	BioNLP2004	MIT-Restaurant	MIT-Movie	平均	CoNLL2004	ADE	平均	SQuAD	SQuAD-V2	DROP	平均
OPT-C4-TuluV3	50.24	39.76	58.91	56.33	50.56	47.14	45.66	46.40	39.80	53.81	31.00	41.54
RoBERTa	33.75	32.91	62.15	58.32	46.80	34.16	2.15	18.15	31.86	48.55	9.16	29.86
MRQA	72.45	55.93	68.68	66.26	65.83	66.23	67.44	66.84	80.07	66.22	54.46	66.92
MultiNERD	66.78	54.62	64.16	66.30	60.59	57.52	45.10	51.31	42.85	50.99	30.12	41.32
NuNER	74.15	56.36	68.57	64.88	65.99	65.12	63.71	64.42	61.60	52.67	37.37	50.55
MetaIE	71.33	55.63	70.08	65.23	65.57	64.81	64.40	64.61	74.59	62.54	30.73	55.95
布穀鳥模型（Cuckoo 🐦🛠️）	73.60	57.00	67.63	67.12	66.34	69.57	71.70	70.63	77.47	64.06	54.25	65.26
└─ 僅預訓練（Only Pre-train 🐦）	72.46	55.87	66.87	67.23	65.61	68.14	69.39	68.77	75.64	63.36	52.81	63.94
└─ 僅後訓練（Only Post-train）	72.80	56.10	66.02	67.10	65.51	68.66	69.75	69.21	77.05	62.39	54.80	64.75
彩虹布穀鳥模型（Rainbow Cuckoo 🌈🐦🛠️）	79.94	58.39	70.30	67.00	68.91	70.47	76.05	73.26	86.57	69.41	64.64	73.54

可用的預訓練模型

目前開源的布穀鳥模型檢查點基於以下數據進行預訓練：

從C4轉換而來的1億個下一個標記提取（NTE）實例。（Cuckoo-C4 🐦）
Cuckoo-C4 + 從有監督微調數據集TuluV3轉換而來的260萬個下一個標記提取（NTE）實例。（Cuckoo-C4-Instruct 🐦🛠️）
Cuckoo-C4-Instruct + MultiNERD、MetaIE、NuNER、MRQA（不包括SQuAD、DROP）。（Cuckoo-C4-Rainbow 🌈🐦🛠️）
Cuckoo-C4-Rainbow + 多個NER數據集、WizardLM數據集、多項選擇問答數據集、MMLU、SQuAD、DROP、MNLI、SNLI。（Cuckoo-C4-Super-Rainbow 🦸🌈🐦🛠️）

🔧 技術細節

文檔未提供具體技術細節，暫不展示。

📄 許可證

本項目使用Apache-2.0許可證。

🐾 引用

@article{DBLP:journals/corr/abs-2502-11275,
  author       = {Letian Peng and
                  Zilong Wang and
                  Feng Yao and
                  Jingbo Shang},
  title        = {Cuckoo: An {IE} Free Rider Hatched by Massive Nutrition in {LLM}'s Nest},
  journal      = {CoRR},
  volume       = {abs/2502.11275},
  year         = {2025},
  url          = {https://doi.org/10.48550/arXiv.2502.11275},
  doi          = {10.48550/arXiv.2502.11275},
  eprinttype   = {arXiv},
  eprint       = {2502.11275},
  timestamp    = {Mon, 17 Feb 2025 19:32:20 +0000},
  biburl       = {https://dblp.org/rec/journals/corr/abs-2502-11275.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}