Cuckoo-C4-Super-Rainbow开源信息提取模型 - 利用文本资源增强，高效提取信息

首页

Cuckoo C4 Super Rainbow

由 KomeijiForce 开发

布谷鸟是一个3亿参数的信息提取模型，通过模仿大语言模型的下一词元预测范式进行信息提取，能够利用各种文本资源进行自我增强。

大型语言模型

Transformers

开源协议:Apache-2.0 #信息提取自由骑士 #小样本适应 #下一词元预测

下载量 159

发布时间 : 2/16/2025

模型简介

布谷鸟模型是一个小型但高效的信息提取模型，它通过标记给定上下文中的下一词元进行预测，不同于传统的词表检索方法。该模型特别擅长利用为大语言模型准备的数据进行自我增强。

模型特点

自我增强能力

能够利用任何文本资源进行自我增强，尤其擅长利用为大语言模型准备的数据。

小样本适应

擅长小样本适应，在少量标注数据下也能取得良好性能。

多任务处理

能够处理多种信息提取任务，包括实体识别、关系抽取等。

模型能力

实体识别

关系抽取

下一词元预测

信息提取

小样本学习

使用案例

文本理解

人物和地点识别

从文本中识别提到的人物和地点

示例输出：提到的人物有哪些？ ['汤姆', '杰克']

事件理解

理解文本中描述的事件和活动

示例输出：汤姆和杰克去巴黎的目的？ ['旅行']

知识问答

属性查询

回答关于对象属性的简单问题

示例输出：草 ['绿色']

🚀 布谷鸟模型（Cuckoo）

布谷鸟（Cuckoo）是一个小型（3亿参数）的信息提取（IE）模型，它模仿大语言模型的下一个标记预测范式。该模型通过在给定的输入上下文中标记来预测下一个标记，而不是从词汇表中检索。这使得它能够利用任何文本资源来提升自身性能，尤其可以借助为大语言模型整理的数据实现高效学习。

🚀 快速开始

快速体验下一个标记提取

我们推荐使用最强的超级彩虹布谷鸟模型（Cuckoo-C4-Super-Rainbow）进行零样本提取。

1️⃣ 首先加载模型和分词器

from transformers import AutoModelForTokenClassification, AutoTokenizer
import torch
import spacy

nlp = spacy.load("en_core_web_sm")

device = torch.device("cuda:0")
path = f"KomeijiForce/Cuckoo-C4-Super-Rainbow"
tokenizer = AutoTokenizer.from_pretrained(path)
tagger = AutoModelForTokenClassification.from_pretrained(path).to(device)

2️⃣ 定义下一个标记提取函数

def next_tokens_extraction(text):

    def find_sequences(lst):
        sequences = []
        i = 0
        while i < len(lst):
            if lst[i] == 0:
                start = i
                end = i
                i += 1
                while i < len(lst) and lst[i] == 1:
                    end = i
                    i += 1
                sequences.append((start, end+1))
            else:
                i += 1
        return sequences

    text = " ".join([token.text for token in nlp(text)])

    inputs = tokenizer(text, return_tensors="pt").to(device)
    tag_predictions = tagger(**inputs).logits[0].argmax(-1)

    predictions = [tokenizer.decode(inputs.input_ids[0, seq[0]:seq[1]]).strip() for seq in find_sequences(tag_predictions)]
    
    return predictions

3️⃣ 调用函数进行提取！

基础用法

text = "Tom and Jack went to their trip in Paris."

for question in [
    "What is the person mentioned here?",
    "What is the city mentioned here?",
    "Who goes with Tom together?",
    "What do Tom and Jack go to Paris for?",
    "Where does George live in?",
]:
    prompt = f"User:\n\n{text}\n\nQuestion: {question}\n\nAssistant:"
    predictions = next_tokens_extraction(prompt)
    print(question, predictions)

你将得到类似如下的结果：

What is the person mentioned here? ['Tom', 'Jack']
What is the city mentioned here? ['Paris']
Who goes with Tom together? ['Jack']
What do Tom and Jack go to Paris for? ['trip']
Where does George live in? []

其中 [] 表示布谷鸟模型认为没有可提取的下一个标记。

高级用法

passage = f'''Ludwig van Beethoven (17 December 1770 – 26 March 1827) was a German composer and pianist. He is one of the most revered figures in the history of Western music; his works rank among the most performed of the classical music repertoire and span the transition from the Classical period to the Romantic era in classical music. His early period, during which he forged his craft, is typically considered to have lasted until 1802. From 1802 to around 1812, his middle period showed an individual development from the styles of Joseph Haydn and Wolfgang Amadeus Mozart, and is sometimes characterised as heroic. During this time, Beethoven began to grow increasingly deaf. In his late period, from 1812 to 1827, he extended his innovations in musical form and expression.'''

for question in [
    "What are the people mentioned here?",
    "What is the job of Beethoven?",
    "How famous is Beethoven?",
    "When did Beethoven's middle period showed an individual development?",
]:
    text = f"User:\n\n{passage}\n\nQuestion: {question}\n\nAssistant:"
    predictions = next_tokens_extraction(text)
    print(question, predictions)

你将得到类似如下的结果：

What are the people mentioned here? ['Ludwig van Beethoven', 'Joseph Haydn', 'Wolfgang Amadeus Mozart']
What is the job of Beethoven? ['composer and pianist']
How famous is Beethoven? ['one of the most revered figures in the history of Western music']
When did Beethoven's middle period showed an individual development? ['1802']

小样本适配

布谷鸟模型在小样本适配自身任务方面表现出色。以下是一些示例：

以CoNLL2003为例：运行命令 bash run_downstream.sh conll2003.5shot KomeijiForce/Cuckoo-C4-Rainbow，你将在 models/cuckoo-conll2003.5shot 中得到一个微调后的模型。然后可以使用脚本 python eval_conll2003.py 对模型进行基准测试，其F1性能约为80。
机器阅读理解（SQuAD）适配：运行命令 bash run_downstream.sh squad.32shot KomeijiForce/Cuckoo-C4-Rainbow，你将在 models/cuckoo-squad.32shot 中得到一个微调后的模型。然后使用脚本 python eval_squad.py 进行基准测试，F1性能约为88。

如果你要微调自己的任务，需要创建一个Jsonlines文件，每行包含 {"words": [...], "ner": [...]}，例如：

{"words": ["I", "am", "John", "Smith", ".", "Person", ":"], "ner": ["O", "O", "B", "I", "O", "O", "O"]}

这表示 "John Smith" 被预测为下一个标记。

你可以参考以下提示开始构建自己的下游数据集：

类型	用户输入	助手响应
实体	用户：[上下文] 问题：提到的 [标签] 是什么？	助手：答案：[标签] 是
关系（Kill）	用户：[上下文] 问题：[实体] 杀死了谁？	助手：答案：[实体] 杀死了
关系（Live）	用户：[上下文] 问题：[实体] 住在哪里？	助手：答案：[实体] 住在
关系（Work）	用户：[上下文] 问题：[实体] 为谁工作？	助手：答案：[实体] 为
关系（Located）	用户：[上下文] 问题：[实体] 位于哪里？	助手：答案：[实体] 位于
关系（Based）	用户：[上下文] 问题：[实体] 基于哪里？	助手：答案：[实体] 基于
关系（Adverse）	用户：[上下文] 问题：[实体] 的不良反应是什么？	助手：答案：[实体] 的不良反应是
查询	用户：[上下文] 问题：[问题]	助手：答案：
指令（实体）	用户：[上下文] 问题：提到的 [标签] 是什么？（[指令]）	助手：答案：[标签] 是
指令（查询）	用户：[上下文] 问题：[问题]（[指令]）	助手：答案：

构建好自己的下游数据集后，将其保存为 my_downstream.json，然后运行命令 bash run_downstream.sh my_downstream KomeijiForce/Cuckoo-C4-Rainbow。你将在 models/cuckoo-my_downstream 中找到适配后的布谷鸟模型。

训练自己的布谷鸟模型

我们在 nte_data_collection.py 文件中提供了将文本转换为NTE实例的脚本，以C4为例，转换结果可以在 cuckoo.c4.example.json 中查看。该脚本易于适配其他资源，如实体、查询和问题，你可以将自己的数据修改为NTE格式来训练自己的布谷鸟模型！运行 run_cuckoo.sh 脚本进行示例预训练：

python run_ner.py \
  --model_name_or_path roberta-large \
  --train_file cuckoo.c4.example.json \
  --output_dir models/cuckoo-c4-example \
  --per_device_train_batch_size 4\
  --gradient_accumulation_steps 16\
  --num_train_epochs 1\
  --save_steps 1000\
  --learning_rate 0.00001\
  --do_train \
  --overwrite_output_dir

你将在 models/cuckoo-c4-example 中得到一个示例布谷鸟模型。如果预训练数据过少，模型性能可能不佳。你可以调整 nte_data_collection.py 中的超参数或修改转换逻辑，以实现更好的预训练性能。

✨ 主要特性

创新的信息提取范式：布谷鸟模型模仿大语言模型的下一个标记预测范式，通过在给定输入上下文中标记来预测下一个标记，与传统的信息提取预训练方法有很大不同。
数据利用高效：可以利用任何文本资源来提升自身性能，尤其能够借助为大语言模型整理的数据实现高效学习。
多场景适配能力：支持零样本提取和小样本适配，能够快速适应不同的信息提取任务。

📦 安装指南

文档未提供具体安装步骤，暂不展示。

💻 使用示例

上述快速开始部分已包含详细的使用示例，此处不再赘述。

📚 详细文档

性能展示

布谷鸟模型在多种信息提取任务上表现出色，以下是其与其他模型的性能对比：

	CoNLL2003	BioNLP2004	MIT-Restaurant	MIT-Movie	平均	CoNLL2004	ADE	平均	SQuAD	SQuAD-V2	DROP	平均
OPT-C4-TuluV3	50.24	39.76	58.91	56.33	50.56	47.14	45.66	46.40	39.80	53.81	31.00	41.54
RoBERTa	33.75	32.91	62.15	58.32	46.80	34.16	2.15	18.15	31.86	48.55	9.16	29.86
MRQA	72.45	55.93	68.68	66.26	65.83	66.23	67.44	66.84	80.07	66.22	54.46	66.92
MultiNERD	66.78	54.62	64.16	66.30	60.59	57.52	45.10	51.31	42.85	50.99	30.12	41.32
NuNER	74.15	56.36	68.57	64.88	65.99	65.12	63.71	64.42	61.60	52.67	37.37	50.55
MetaIE	71.33	55.63	70.08	65.23	65.57	64.81	64.40	64.61	74.59	62.54	30.73	55.95
布谷鸟模型（Cuckoo 🐦🛠️）	73.60	57.00	67.63	67.12	66.34	69.57	71.70	70.63	77.47	64.06	54.25	65.26
└─ 仅预训练（Only Pre-train 🐦）	72.46	55.87	66.87	67.23	65.61	68.14	69.39	68.77	75.64	63.36	52.81	63.94
└─ 仅后训练（Only Post-train）	72.80	56.10	66.02	67.10	65.51	68.66	69.75	69.21	77.05	62.39	54.80	64.75
彩虹布谷鸟模型（Rainbow Cuckoo 🌈🐦🛠️）	79.94	58.39	70.30	67.00	68.91	70.47	76.05	73.26	86.57	69.41	64.64	73.54

可用的预训练模型

目前开源的布谷鸟模型检查点基于以下数据进行预训练：

从C4转换而来的1亿个下一个标记提取（NTE）实例。（Cuckoo-C4 🐦）
Cuckoo-C4 + 从有监督微调数据集TuluV3转换而来的260万个下一个标记提取（NTE）实例。（Cuckoo-C4-Instruct 🐦🛠️）
Cuckoo-C4-Instruct + MultiNERD、MetaIE、NuNER、MRQA（不包括SQuAD、DROP）。（Cuckoo-C4-Rainbow 🌈🐦🛠️）
Cuckoo-C4-Rainbow + 多个NER数据集、WizardLM数据集、多项选择问答数据集、MMLU、SQuAD、DROP、MNLI、SNLI。（Cuckoo-C4-Super-Rainbow 🦸🌈🐦🛠️）

🔧 技术细节

文档未提供具体技术细节，暂不展示。

📄 许可证

本项目使用Apache-2.0许可证。

🐾 引用

@article{DBLP:journals/corr/abs-2502-11275,
  author       = {Letian Peng and
                  Zilong Wang and
                  Feng Yao and
                  Jingbo Shang},
  title        = {Cuckoo: An {IE} Free Rider Hatched by Massive Nutrition in {LLM}'s Nest},
  journal      = {CoRR},
  volume       = {abs/2502.11275},
  year         = {2025},
  url          = {https://doi.org/10.48550/arXiv.2502.11275},
  doi          = {10.48550/arXiv.2502.11275},
  eprinttype   = {arXiv},
  eprint       = {2502.11275},
  timestamp    = {Mon, 17 Feb 2025 19:32:20 +0000},
  biburl       = {https://dblp.org/rec/journals/corr/abs-2502-11275.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}