Cuckoo-C4-Rainbow开源信息抽取模型 - 小身材大作用精准抽取信息

首页

Cuckoo C4 Rainbow

由 KomeijiForce 开发

布谷鸟是一个小型（3亿参数）信息抽取（IE）模型，模仿大语言模型的下一词预测范式，通过标记给定上下文中的下一词进行预测。

知识图谱

Transformers

开源协议:Apache-2.0 #下一词抽取 #小样本适应 #信息抽取自由骑士

下载量 17

发布时间 : 2/16/2025

模型简介

布谷鸟模型是一种创新的信息抽取模型，它利用下一词抽取（NTE）范式进行预测，能够从文本中高效提取各类信息。

模型特点

下一词抽取范式

不同于传统方法，布谷鸟通过标记上下文中的下一词进行预测，模仿大语言模型的预测方式。

自我增强能力

可以利用任何文本资源进行自我增强，特别是能利用为大型语言模型准备的数据。

高效适应

擅长小样本适应特定任务，在各类信息抽取任务中表现出色。

多版本支持

提供基础版、指令版、彩虹版和超级彩虹版等多种预训练版本，适应不同需求。

模型能力

实体识别

关系抽取

知识问答

长文本理解

小样本适应

使用案例

信息抽取

基础实体和关系理解

从文本中抽取人名、地点等实体以及它们之间的关系。

示例输出：['Tom', 'Jack']（人名），['巴黎']（地点）

长文本理解

从复杂长文本中抽取关键信息和关系。

示例输出：['路德维希·范·贝多芬']（人名），['作曲家和钢琴家']（职业）

知识问答

回答基于文本内容的简单知识问题。

示例输出：['绿色']（草的颜色），['蓝色']（海的颜色）

定制化应用

小样本适应

通过少量样本快速适应特定领域的信息抽取任务。

在CoNLL2003数据集上F1值可达80左右

🚀 布谷鸟模型（Cuckoo）🐦

布谷鸟模型（Cuckoo）是一个小型（3亿参数）的信息提取（IE）模型，它模仿大语言模型的下一个标记预测范式。与从词汇表中检索不同，Cuckoo通过在给定的输入上下文中标记下一个标记来进行预测。该模型能够利用任何文本资源来提升自身性能，尤其可以借助为大语言模型整理的数据来实现能力增强。

🚀 快速开始

目前，我们开源了在不同数据集上预训练的布谷鸟模型检查点：

从C4转换而来的1亿个下一个标记提取（NTE）实例。（Cuckoo-C4 🐦）
Cuckoo-C4 + 从有监督微调数据集TuluV3转换而来的260万个下一个标记提取（NTE）实例。（Cuckoo-C4-Instruct 🐦🛠️）
Cuckoo-C4-Instruct + MultiNERD、MetaIE、NuNER、MRQA（不包括SQuAD、DROP）。（Cuckoo-C4-Rainbow 🌈🐦🛠️）
Cuckoo-C4-Rainbow + 多个命名实体识别（NER）数据集、WizardLM数据集、多项选择问答数据集、MMLU、SQuAD、DROP、MNLI、SNLI。（Cuckoo-C4-Super-Rainbow 🦸🌈🐦🛠️）

✨ 主要特性

模仿预测范式：模仿大语言模型的下一个标记预测范式，通过在输入上下文中标记下一个标记来进行预测。
数据利用高效：可以使用任何文本资源来增强自身，特别是借助为大语言模型整理的数据。
多场景适应性：在多种信息提取任务中表现出色，支持少样本适应。

📊 性能展示

以下是布谷鸟模型在不同数据集上的性能表现，与其他模型进行了对比：

	CoNLL2003	BioNLP2004	MIT-Restaurant	MIT-Movie	Avg.	CoNLL2004	ADE	Avg.	SQuAD	SQuAD-V2	DROP	Avg.
OPT-C4-TuluV3	50.24	39.76	58.91	56.33	50.56	47.14	45.66	46.40	39.80	53.81	31.00	41.54
RoBERTa	33.75	32.91	62.15	58.32	46.80	34.16	2.15	18.15	31.86	48.55	9.16	29.86
MRQA	72.45	55.93	68.68	66.26	65.83	66.23	67.44	66.84	80.07	66.22	54.46	66.92
MultiNERD	66.78	54.62	64.16	66.30	60.59	57.52	45.10	51.31	42.85	50.99	30.12	41.32
NuNER	74.15	56.36	68.57	64.88	65.99	65.12	63.71	64.42	61.60	52.67	37.37	50.55
MetaIE	71.33	55.63	70.08	65.23	65.57	64.81	64.40	64.61	74.59	62.54	30.73	55.95
Cuckoo 🐦🛠️	73.60	57.00	67.63	67.12	66.34	69.57	71.70	70.63	77.47	64.06	54.25	65.26
└─ Only Pre-train 🐦	72.46	55.87	66.87	67.23	65.61	68.14	69.39	68.77	75.64	63.36	52.81	63.94
└─ Only Post-train	72.80	56.10	66.02	67.10	65.51	68.66	69.75	69.21	77.05	62.39	54.80	64.75
Rainbow Cuckoo 🌈🐦🛠️	79.94	58.39	70.30	67.00	68.91	70.47	76.05	73.26	86.57	69.41	64.64	73.54

💻 使用示例

基础用法

我们推荐使用最强的超级彩虹布谷鸟模型（Cuckoo-C4-Super-Rainbow）进行零样本提取。以下是具体步骤：

加载模型和分词器

from transformers import AutoModelForTokenClassification, AutoTokenizer
import torch
import spacy

nlp = spacy.load("en_core_web_sm")

device = torch.device("cuda:0")
path = f"KomeijiForce/Cuckoo-C4-Super-Rainbow"
tokenizer = AutoTokenizer.from_pretrained(path)
tagger = AutoModelForTokenClassification.from_pretrained(path).to(device)

定义下一个标记提取函数

def next_tokens_extraction(text):

    def find_sequences(lst):
        sequences = []
        i = 0
        while i < len(lst):
            if lst[i] == 0:
                start = i
                end = i
                i += 1
                while i < len(lst) and lst[i] == 1:
                    end = i
                    i += 1
                sequences.append((start, end+1))
            else:
                i += 1
        return sequences

    text = " ".join([token.text for token in nlp(text)])

    inputs = tokenizer(text, return_tensors="pt").to(device)
    tag_predictions = tagger(**inputs).logits[0].argmax(-1)

    predictions = [tokenizer.decode(inputs.input_ids[0, seq[0]:seq[1]]).strip() for seq in find_sequences(tag_predictions)]
    
    return predictions

调用函数进行提取 以下是不同场景下的调用示例：

基本实体和关系理解

text = "Tom and Jack went to their trip in Paris."

for question in [
    "What is the person mentioned here?",
    "What is the city mentioned here?",
    "Who goes with Tom together?",
    "What do Tom and Jack go to Paris for?",
    "Where does George live in?",
]:
    prompt = f"User:\n\n{text}\n\nQuestion: {question}\n\nAssistant:"
    predictions = next_tokens_extraction(prompt)
    print(question, predictions)

更长上下文处理

passage = f'''Ludwig van Beethoven (17 December 1770 – 26 March 1827) was a German composer and pianist. He is one of the most revered figures in the history of Western music; his works rank among the most performed of the classical music repertoire and span the transition from the Classical period to the Romantic era in classical music. His early period, during which he forged his craft, is typically considered to have lasted until 1802. From 1802 to around 1812, his middle period showed an individual development from the styles of Joseph Haydn and Wolfgang Amadeus Mozart, and is sometimes characterised as heroic. During this time, Beethoven began to grow increasingly deaf. In his late period, from 1812 to 1827, he extended his innovations in musical form and expression.'''

for question in [
    "What are the people mentioned here?",
    "What is the job of Beethoven?",
    "How famous is Beethoven?",
    "When did Beethoven's middle period showed an individual development?",
]:
    text = f"User:\n\n{passage}\n\nQuestion: {question}\n\nAssistant:"
    predictions = next_tokens_extraction(text)
    print(question, predictions)

知识问答

for obj in ["grass", "sea", "fire", "night"]:
    text = f"User:\n\nChoices:\nred\nblue\ngreen.\n\nQuestion: What is the color of the {obj}?\n\nAssistant:\n\nAnswer:"
    predictions = next_tokens_extraction(text)
    print(obj, predictions)

🎯 少样本适应

布谷鸟模型在少样本适应自身任务方面表现出色。以下是一些示例：

以CoNLL2003为例：运行 bash run_downstream.sh conll2003.5shot KomeijiForce/Cuckoo-C4-Rainbow，你将在 models/cuckoo-conll2003.5shot 中得到一个微调后的模型。然后可以使用脚本 python eval_conll2003.py 对模型进行基准测试，F1性能约为80。
机器阅读理解（SQuAD）：运行 bash run_downstream.sh squad.32shot KomeijiForce/Cuckoo-C4-Rainbow，你将在 models/cuckoo-squad.32shot 中得到一个微调后的模型。然后可以使用脚本 python eval_squad.py 对模型进行基准测试，F1性能约为88。

若要对自己的任务进行微调，需要创建一个Jsonlines文件，每行包含 {"words": [...], "ner": [...]}，例如：

{"words": ["I", "am", "John", "Smith", ".", "Person", ":"], "ner": ["O", "O", "B", "I", "O", "O", "O"]}

创建好自己的下游数据集后，将其保存为 my_downstream.json，然后运行命令 bash run_downstream.sh my_downstream KomeijiForce/Cuckoo-C4-Rainbow。你将在 models/cuckoo-my_downstream 中找到适应后的布谷鸟模型。

🪽 训练自己的布谷鸟模型

我们在 nte_data_collection.py 文件中包含了将文本转换为NTE实例的脚本，以C4为例，转换结果可以在 cuckoo.c4.example.json 中查看。该脚本易于适应其他资源，如实体、查询和问题，你可以将自己的数据修改为NTE格式来训练自己的布谷鸟模型！运行 run_cuckoo.sh 脚本尝试一个示例预训练：

python run_ner.py \
  --model_name_or_path roberta-large \
  --train_file cuckoo.c4.example.json \
  --output_dir models/cuckoo-c4-example \
  --per_device_train_batch_size 4\
  --gradient_accumulation_steps 16\
  --num_train_epochs 1\
  --save_steps 1000\
  --learning_rate 0.00001\
  --do_train \
  --overwrite_output_dir

你将在 models/cuckoo-c4-example 中得到一个示例布谷鸟模型。如果预训练数据太少，模型性能可能不佳。你可以调整 nte_data_collection.py 中的超参数或修改转换逻辑以适应自己的资源，从而获得更好的预训练性能。

🐾 引用

@article{DBLP:journals/corr/abs-2502-11275,
  author       = {Letian Peng and
                  Zilong Wang and
                  Feng Yao and
                  Jingbo Shang},
  title        = {Cuckoo: An {IE} Free Rider Hatched by Massive Nutrition in {LLM}'s Nest},
  journal      = {CoRR},
  volume       = {abs/2502.11275},
  year         = {2025},
  url          = {https://doi.org/10.48550/arXiv.2502.11275},
  doi          = {10.48550/arXiv.2502.11275},
  eprinttype   = {arXiv},
  eprint       = {2502.11275},
  timestamp    = {Mon, 17 Feb 2025 19:32:20 +0000},
  biburl       = {https://dblp.org/rec/journals/corr/abs-2502-11275.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}