Cuckoo-C4-Rainbow開源信息抽取模型 - 小身材大作用精準抽取信息

首頁

Cuckoo C4 Rainbow

由KomeijiForce開發

布穀鳥是一個小型（3億參數）信息抽取（IE）模型，模仿大語言模型的下一詞預測範式，通過標記給定上下文中的下一詞進行預測。

知識圖譜

Transformers

開源協議:Apache-2.0 #下一詞抽取 #小樣本適應 #信息抽取自由騎士

下載量 17

發布時間 : 2/16/2025

模型概述

布穀鳥模型是一種創新的信息抽取模型，它利用下一詞抽取（NTE）範式進行預測，能夠從文本中高效提取各類信息。

模型特點

下一詞抽取範式

不同於傳統方法，布穀鳥通過標記上下文中的下一詞進行預測，模仿大語言模型的預測方式。

自我增強能力

可以利用任何文本資源進行自我增強，特別是能利用為大型語言模型準備的數據。

高效適應

擅長小樣本適應特定任務，在各類信息抽取任務中表現出色。

多版本支持

提供基礎版、指令版、彩虹版和超級彩虹版等多種預訓練版本，適應不同需求。

模型能力

實體識別

關係抽取

知識問答

長文本理解

小樣本適應

使用案例

信息抽取

基礎實體和關係理解

從文本中抽取人名、地點等實體以及它們之間的關係。

示例輸出：['Tom', 'Jack']（人名），['巴黎']（地點）

長文本理解

從複雜長文本中抽取關鍵信息和關係。

示例輸出：['路德維希·範·貝多芬']（人名），['作曲家和鋼琴家']（職業）

知識問答

回答基於文本內容的簡單知識問題。

示例輸出：['綠色']（草的顏色），['藍色']（海的顏色）

定製化應用

小樣本適應

通過少量樣本快速適應特定領域的信息抽取任務。

在CoNLL2003數據集上F1值可達80左右

🚀 布穀鳥模型（Cuckoo）🐦

布穀鳥模型（Cuckoo）是一個小型（3億參數）的信息提取（IE）模型，它模仿大語言模型的下一個標記預測範式。與從詞彙表中檢索不同，Cuckoo通過在給定的輸入上下文中標記下一個標記來進行預測。該模型能夠利用任何文本資源來提升自身性能，尤其可以藉助為大語言模型整理的數據來實現能力增強。

🚀 快速開始

目前，我們開源了在不同數據集上預訓練的布穀鳥模型檢查點：

從C4轉換而來的1億個下一個標記提取（NTE）實例。（Cuckoo-C4 🐦）
Cuckoo-C4 + 從有監督微調數據集TuluV3轉換而來的260萬個下一個標記提取（NTE）實例。（Cuckoo-C4-Instruct 🐦🛠️）
Cuckoo-C4-Instruct + MultiNERD、MetaIE、NuNER、MRQA（不包括SQuAD、DROP）。（Cuckoo-C4-Rainbow 🌈🐦🛠️）
Cuckoo-C4-Rainbow + 多個命名實體識別（NER）數據集、WizardLM數據集、多項選擇問答數據集、MMLU、SQuAD、DROP、MNLI、SNLI。（Cuckoo-C4-Super-Rainbow 🦸🌈🐦🛠️）

✨ 主要特性

模仿預測範式：模仿大語言模型的下一個標記預測範式，通過在輸入上下文中標記下一個標記來進行預測。
數據利用高效：可以使用任何文本資源來增強自身，特別是藉助為大語言模型整理的數據。
多場景適應性：在多種信息提取任務中表現出色，支持少樣本適應。

📊 性能展示

以下是布穀鳥模型在不同數據集上的性能表現，與其他模型進行了對比：

	CoNLL2003	BioNLP2004	MIT-Restaurant	MIT-Movie	Avg.	CoNLL2004	ADE	Avg.	SQuAD	SQuAD-V2	DROP	Avg.
OPT-C4-TuluV3	50.24	39.76	58.91	56.33	50.56	47.14	45.66	46.40	39.80	53.81	31.00	41.54
RoBERTa	33.75	32.91	62.15	58.32	46.80	34.16	2.15	18.15	31.86	48.55	9.16	29.86
MRQA	72.45	55.93	68.68	66.26	65.83	66.23	67.44	66.84	80.07	66.22	54.46	66.92
MultiNERD	66.78	54.62	64.16	66.30	60.59	57.52	45.10	51.31	42.85	50.99	30.12	41.32
NuNER	74.15	56.36	68.57	64.88	65.99	65.12	63.71	64.42	61.60	52.67	37.37	50.55
MetaIE	71.33	55.63	70.08	65.23	65.57	64.81	64.40	64.61	74.59	62.54	30.73	55.95
Cuckoo 🐦🛠️	73.60	57.00	67.63	67.12	66.34	69.57	71.70	70.63	77.47	64.06	54.25	65.26
└─ Only Pre-train 🐦	72.46	55.87	66.87	67.23	65.61	68.14	69.39	68.77	75.64	63.36	52.81	63.94
└─ Only Post-train	72.80	56.10	66.02	67.10	65.51	68.66	69.75	69.21	77.05	62.39	54.80	64.75
Rainbow Cuckoo 🌈🐦🛠️	79.94	58.39	70.30	67.00	68.91	70.47	76.05	73.26	86.57	69.41	64.64	73.54

💻 使用示例

基礎用法

我們推薦使用最強的超級彩虹布穀鳥模型（Cuckoo-C4-Super-Rainbow）進行零樣本提取。以下是具體步驟：

加載模型和分詞器

from transformers import AutoModelForTokenClassification, AutoTokenizer
import torch
import spacy

nlp = spacy.load("en_core_web_sm")

device = torch.device("cuda:0")
path = f"KomeijiForce/Cuckoo-C4-Super-Rainbow"
tokenizer = AutoTokenizer.from_pretrained(path)
tagger = AutoModelForTokenClassification.from_pretrained(path).to(device)

定義下一個標記提取函數

def next_tokens_extraction(text):

    def find_sequences(lst):
        sequences = []
        i = 0
        while i < len(lst):
            if lst[i] == 0:
                start = i
                end = i
                i += 1
                while i < len(lst) and lst[i] == 1:
                    end = i
                    i += 1
                sequences.append((start, end+1))
            else:
                i += 1
        return sequences

    text = " ".join([token.text for token in nlp(text)])

    inputs = tokenizer(text, return_tensors="pt").to(device)
    tag_predictions = tagger(**inputs).logits[0].argmax(-1)

    predictions = [tokenizer.decode(inputs.input_ids[0, seq[0]:seq[1]]).strip() for seq in find_sequences(tag_predictions)]
    
    return predictions

調用函數進行提取 以下是不同場景下的調用示例：

基本實體和關係理解

text = "Tom and Jack went to their trip in Paris."

for question in [
    "What is the person mentioned here?",
    "What is the city mentioned here?",
    "Who goes with Tom together?",
    "What do Tom and Jack go to Paris for?",
    "Where does George live in?",
]:
    prompt = f"User:\n\n{text}\n\nQuestion: {question}\n\nAssistant:"
    predictions = next_tokens_extraction(prompt)
    print(question, predictions)

更長上下文處理

passage = f'''Ludwig van Beethoven (17 December 1770 – 26 March 1827) was a German composer and pianist. He is one of the most revered figures in the history of Western music; his works rank among the most performed of the classical music repertoire and span the transition from the Classical period to the Romantic era in classical music. His early period, during which he forged his craft, is typically considered to have lasted until 1802. From 1802 to around 1812, his middle period showed an individual development from the styles of Joseph Haydn and Wolfgang Amadeus Mozart, and is sometimes characterised as heroic. During this time, Beethoven began to grow increasingly deaf. In his late period, from 1812 to 1827, he extended his innovations in musical form and expression.'''

for question in [
    "What are the people mentioned here?",
    "What is the job of Beethoven?",
    "How famous is Beethoven?",
    "When did Beethoven's middle period showed an individual development?",
]:
    text = f"User:\n\n{passage}\n\nQuestion: {question}\n\nAssistant:"
    predictions = next_tokens_extraction(text)
    print(question, predictions)

知識問答

for obj in ["grass", "sea", "fire", "night"]:
    text = f"User:\n\nChoices:\nred\nblue\ngreen.\n\nQuestion: What is the color of the {obj}?\n\nAssistant:\n\nAnswer:"
    predictions = next_tokens_extraction(text)
    print(obj, predictions)

🎯 少樣本適應

布穀鳥模型在少樣本適應自身任務方面表現出色。以下是一些示例：

以CoNLL2003為例：運行 bash run_downstream.sh conll2003.5shot KomeijiForce/Cuckoo-C4-Rainbow，你將在 models/cuckoo-conll2003.5shot 中得到一個微調後的模型。然後可以使用腳本 python eval_conll2003.py 對模型進行基準測試，F1性能約為80。
機器閱讀理解（SQuAD）：運行 bash run_downstream.sh squad.32shot KomeijiForce/Cuckoo-C4-Rainbow，你將在 models/cuckoo-squad.32shot 中得到一個微調後的模型。然後可以使用腳本 python eval_squad.py 對模型進行基準測試，F1性能約為88。

若要對自己的任務進行微調，需要創建一個Jsonlines文件，每行包含 {"words": [...], "ner": [...]}，例如：

{"words": ["I", "am", "John", "Smith", ".", "Person", ":"], "ner": ["O", "O", "B", "I", "O", "O", "O"]}

創建好自己的下游數據集後，將其保存為 my_downstream.json，然後運行命令 bash run_downstream.sh my_downstream KomeijiForce/Cuckoo-C4-Rainbow。你將在 models/cuckoo-my_downstream 中找到適應後的布穀鳥模型。

🪽 訓練自己的布穀鳥模型

我們在 nte_data_collection.py 文件中包含了將文本轉換為NTE實例的腳本，以C4為例，轉換結果可以在 cuckoo.c4.example.json 中查看。該腳本易於適應其他資源，如實體、查詢和問題，你可以將自己的數據修改為NTE格式來訓練自己的布穀鳥模型！運行 run_cuckoo.sh 腳本嘗試一個示例預訓練：

python run_ner.py \
  --model_name_or_path roberta-large \
  --train_file cuckoo.c4.example.json \
  --output_dir models/cuckoo-c4-example \
  --per_device_train_batch_size 4\
  --gradient_accumulation_steps 16\
  --num_train_epochs 1\
  --save_steps 1000\
  --learning_rate 0.00001\
  --do_train \
  --overwrite_output_dir

你將在 models/cuckoo-c4-example 中得到一個示例布穀鳥模型。如果預訓練數據太少，模型性能可能不佳。你可以調整 nte_data_collection.py 中的超參數或修改轉換邏輯以適應自己的資源，從而獲得更好的預訓練性能。

🐾 引用

@article{DBLP:journals/corr/abs-2502-11275,
  author       = {Letian Peng and
                  Zilong Wang and
                  Feng Yao and
                  Jingbo Shang},
  title        = {Cuckoo: An {IE} Free Rider Hatched by Massive Nutrition in {LLM}'s Nest},
  journal      = {CoRR},
  volume       = {abs/2502.11275},
  year         = {2025},
  url          = {https://doi.org/10.48550/arXiv.2502.11275},
  doi          = {10.48550/arXiv.2502.11275},
  eprinttype   = {arXiv},
  eprint       = {2502.11275},
  timestamp    = {Mon, 17 Feb 2025 19:32:20 +0000},
  biburl       = {https://dblp.org/rec/journals/corr/abs-2502-11275.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}