开源Cuckoo-C4-Instruct模型 - 像大语言模型一样高效抽取信息！

首页

Cuckoo C4 Instruct

由 KomeijiForce 开发

超级彩虹布谷鸟是一个基于下一词抽取(NTE)范式的小型信息抽取模型，通过模仿大语言模型的预测方式实现高效信息抽取。

问答系统

Transformers

开源协议:MIT #信息抽取 #小样本适应 #问答系统

下载量 17

发布时间 : 2/16/2025

模型简介

布谷鸟模型是一个3亿参数的小型信息抽取(IE)模型，它创新性地采用下一词预测范式进行信息抽取。与传统的词表检索不同，布谷鸟通过标记给定上下文中的下一词进行预测，能够利用各种文本资源进行自我增强。

模型特点

下一词抽取范式

创新性地模仿大语言模型的下一词预测方式，通过标记上下文中的下一词进行信息抽取

自我增强能力

能够利用任何文本资源进行自我增强，特别是通过大语言模型的数据准备

高效适应

在小样本场景下表现出优异的适应能力，可快速适应特定任务

多任务集成

整合了多种信息抽取任务的数据集，包括NER、QA等

模型能力

命名实体识别

关系抽取

问答系统

信息抽取

小样本学习

使用案例

知识抽取

实体识别

从文本中识别命名实体如人名、地名等

在CoNLL2003上F1达88.38

关系抽取

识别实体间的关系如居住地、工作单位等

问答系统

阅读理解

从给定文本中抽取问题答案

在SQuAD上F1达89.54

🚀 布谷鸟模型（Cuckoo）🐦

布谷鸟（Cuckoo）系列模型是抽取式问答模型，可有效解决信息抽取任务中的问题，为相关领域的研究和应用提供了强大的支持。

🚀 快速开始

布谷鸟（Cuckoo）系列模型是如论文 Cuckoo: An IE Free Rider Hatched by Massive Nutrition in LLM's Nest 中所描述的抽取式问答模型。

布谷鸟是一个小型（3亿参数）的信息抽取（IE）模型，它模仿大语言模型的下一个词预测范式。与从词汇表中检索不同，布谷鸟通过在给定的输入上下文中标记来预测下一个词，如下所示：

布谷鸟

布谷鸟与之前的信息抽取预训练有很大不同，因为它可以使用任何文本资源来提升自身能力，特别是可以借助为大语言模型精心整理的数据！

目前，我们开源了在以下数据上预训练的布谷鸟模型检查点：

从C4转换而来的1亿个下一个词抽取（NTE）实例。(布谷鸟 - C4 🐦)
布谷鸟 - C4 + 从有监督微调数据集TuluV3转换而来的260万个下一个词抽取（NTE）实例。(布谷鸟 - C4 - 指令 🐦🛠️)
布谷鸟 - C4 - 指令 + MultiNERD、MetaIE、NuNER、MRQA（不包括SQuAD、DROP）。(布谷鸟 - C4 - 彩虹 🌈🐦🛠️)
布谷鸟 - C4 - 彩虹 + 多个命名实体识别（NER）数据集、WizardLM数据集、多项选择问答数据集、MMLU、SQuAD、DROP、MNLI、SNLI。(布谷鸟 - C4 - 超级彩虹 🦸🌈🐦🛠️)

✨ 主要特性

性能展示 🚀

开启你的布谷鸟之旅，体验它在各种信息抽取任务中难以想象的适配效率！

模型	CoNLL2003	BioNLP2004	MIT - 餐厅	MIT - 电影	平均	CoNLL2004	ADE	平均	SQuAD	SQuAD - V2	DROP	平均
OPT - C4 - TuluV3	50.24	39.76	58.91	56.33	50.56	47.14	45.66	46.40	39.80	53.81	31.00	41.54
RoBERTa	33.75	32.91	62.15	58.32	46.80	34.16	2.15	18.15	31.86	48.55	9.16	29.86
MRQA	72.45	55.93	68.68	66.26	65.83	66.23	67.44	66.84	80.07	66.22	54.46	66.92
MultiNERD	66.78	54.62	64.16	66.30	60.59	57.52	45.10	51.31	42.85	50.99	30.12	41.32
NuNER	74.15	56.36	68.57	64.88	65.99	65.12	63.71	64.42	61.60	52.67	37.37	50.55
MetaIE	71.33	55.63	70.08	65.23	65.57	64.81	64.40	64.61	74.59	62.54	30.73	55.95
布谷鸟 🐦🛠️	73.60	57.00	67.63	67.12	66.34	69.57	71.70	70.63	77.47	64.06	54.25	65.26
└─ 仅预训练 🐦	72.46	55.87	66.87	67.23	65.61	68.14	69.39	68.77	75.64	63.36	52.81	63.94
└─ 仅后训练	72.80	56.10	66.02	67.10	65.51	68.66	69.75	69.21	77.05	62.39	54.80	64.75
彩虹布谷鸟 🌈🐦🛠️	79.94	58.39	70.30	67.00	68.91	70.47	76.05	73.26	86.57	69.41	64.64	73.54

(超级彩虹布谷鸟 🦸🌈🐦🛠️ 使用除CoNLL2004和ADE之外的训练集来提升其性能)

模型	CoNLL2003	BioNLP2004	MIT - 餐厅	MIT - 电影	平均	CoNLL2004	ADE	平均	SQuAD	SQuAD - V2	DROP	平均
超级彩虹布谷鸟 🦸🌈🐦🛠️	88.38	68.33	76.79	69.39	75.22	72.96	80.06	76.51	89.54	74.52	74.89	79.65

💻 使用示例

快速体验布谷鸟在下一个词抽取中的应用 ⚡

我们建议使用最强的超级彩虹布谷鸟 🦸🌈🐦🛠️ 进行零样本抽取。你可以直接在 case_next_tokens_extraction.py 中运行以下示例。

基础用法

# 首先加载模型和分词器
from transformers import AutoModelForTokenClassification, AutoTokenizer
import torch
import spacy

nlp = spacy.load("en_core_web_sm")

device = torch.device("cuda:0")
path = f"KomeijiForce/Cuckoo-C4-Super-Rainbow"
tokenizer = AutoTokenizer.from_pretrained(path)
tagger = AutoModelForTokenClassification.from_pretrained(path).to(device)

# 定义下一个词抽取函数
def next_tokens_extraction(text):

    def find_sequences(lst):
        sequences = []
        i = 0
        while i < len(lst):
            if lst[i] == 0:
                start = i
                end = i
                i += 1
                while i < len(lst) and lst[i] == 1:
                    end = i
                    i += 1
                sequences.append((start, end+1))
            else:
                i += 1
        return sequences

    text = " ".join([token.text for token in nlp(text)])

    inputs = tokenizer(text, return_tensors="pt").to(device)
    tag_predictions = tagger(**inputs).logits[0].argmax(-1)

    predictions = [tokenizer.decode(inputs.input_ids[0, seq[0]:seq[1]]).strip() for seq in find_sequences(tag_predictions)]
    
    return predictions

高级用法

# 调用函数进行抽取！
# 案例1: 基本实体和关系理解
text = "Tom and Jack went to their trip in Paris."

for question in [
    "What is the person mentioned here?",
    "What is the city mentioned here?",
    "Who goes with Tom together?",
    "What do Tom and Jack go to Paris for?",
    "Where does George live in?",
]:
    prompt = f"User:\n\n{text}\n\nQuestion: {question}\n\nAssistant:"
    predictions = next_tokens_extraction(prompt)
    print(question, predictions)

# 案例2: 更长的上下文
passage = f'''Ludwig van Beethoven (17 December 1770 – 26 March 1827) was a German composer and pianist. He is one of the most revered figures in the history of Western music; his works rank among the most performed of the classical music repertoire and span the transition from the Classical period to the Romantic era in classical music. His early period, during which he forged his craft, is typically considered to have lasted until 1802. From 1802 to around 1812, his middle period showed an individual development from the styles of Joseph Haydn and Wolfgang Amadeus Mozart, and is sometimes characterised as heroic. During this time, Beethoven began to grow increasingly deaf. In his late period, from 1812 to 1827, he extended his innovations in musical form and expression.'''

for question in [
    "What are the people mentioned here?",
    "What is the job of Beethoven?",
    "How famous is Beethoven?",
    "When did Beethoven's middle period showed an individual development?",
]:
    text = f"User:\n\n{passage}\n\nQuestion: {question}\n\nAssistant:"
    predictions = next_tokens_extraction(text)
    print(question, predictions)

# 案例3: 知识问答
for obj in ["grass", "sea", "fire", "night"]:
    text = f"User:\n\nChoices:\nred\nblue\ngreen.\n\nQuestion: What is the color of the {obj}?\n\nAssistant:\n\nAnswer:"
    predictions = next_tokens_extraction(text)
    print(obj, predictions)

少样本适配 🎯

布谷鸟 🐦 在对自己的任务进行少样本适配方面是专家，以CoNLL2003为例，运行 bash run_downstream.sh conll2003.5shot KomeijiForce/Cuckoo-C4-Rainbow，你将在 models/cuckoo-conll2003.5shot 中得到一个微调后的模型。然后你可以使用脚本 python eval_conll2003.py 对模型进行基准测试，它将显示大约80的F1性能。

你也可以训练对机器阅读理解（SQuAD）的适配，运行 bash run_downstream.sh squad.32shot KomeijiForce/Cuckoo-C4-Rainbow，你将在 models/cuckoo-squad.32shot 中得到一个微调后的模型。然后你可以使用脚本 python eval_squad.py 对模型进行基准测试，它将显示大约88的F1性能。

要微调你自己的任务，你需要创建一个Jsonlines文件，每行包含 {"words": [...], "ner": [...]}，例如：

{"words": ["I", "am", "John", "Smith", ".", "Person", ":"], "ner": ["O", "O", "B", "I", "O", "O", "O"]}

这表明 "John Smith" 被预测为下一个词。

你可以参考以下一些提示来开始：

类型	用户输入	助手响应
实体	用户: [上下文] 问题: 提到的 [标签] 是什么?	助手: 答案: [标签] 是
关系（杀死）	用户: [上下文] 问题: [实体] 杀死了谁?	助手: 答案: [实体] 杀死了
关系（居住）	用户: [上下文] 问题: [实体] 住在哪里?	助手: 答案: [实体] 住在
关系（工作）	用户: [上下文] 问题: [实体] 为谁工作?	助手: 答案: [实体] 为
关系（位于）	用户: [上下文] 问题: [实体] 位于哪里?	助手: 答案: [实体] 位于
关系（基于）	用户: [上下文] 问题: [实体] 基于哪里?	助手: 答案: [实体] 基于
关系（不良影响）	用户: [上下文] 问题: [实体] 的不良影响是什么?	助手: 答案: [实体] 的不良影响是
查询	用户: [上下文] 问题: [问题]	助手: 答案:
指令（实体）	用户: [上下文] 问题: 提到的 [标签] 是什么? ([指令])	助手: 答案: [标签] 是
指令（查询）	用户: [上下文] 问题: [问题] ([指令])	助手: 答案:

构建自己的下游数据集后，将其保存到 my_downstream.json 中，然后运行命令 bash run_downstream.sh my_downstream KomeijiForce/Cuckoo-C4-Rainbow。你将在 models/cuckoo-my_downstream 中找到一个适配后的布谷鸟模型。

放飞你自己的布谷鸟 🪽

我们在文件 nte_data_collection.py 中包含了将文本转换为NTE实例的脚本，该脚本以C4为例，转换结果可以在 cuckoo.c4.example.json 中查看。该脚本旨在易于适配其他资源，如实体、查询和问题，你可以将自己的数据修改为NTE以放飞你自己的布谷鸟！运行 run_cuckoo.sh 脚本来尝试一个示例预训练。

python run_ner.py \
  --model_name_or_path roberta-large \
  --train_file cuckoo.c4.example.json \
  --output_dir models/cuckoo-c4-example \
  --per_device_train_batch_size 4\
  --gradient_accumulation_steps 16\
  --num_train_epochs 1\
  --save_steps 1000\
  --learning_rate 0.00001\
  --do_train \
  --overwrite_output_dir

你将在 models/cuckoo-c4-example 中得到一个示例布谷鸟模型，如果你用太少的数据进行预训练，它的性能可能不会很好。你可以调整 nte_data_collection.py 中的超参数，或者修改转换以适配你自己的资源，以实现更好的预训练性能。

📚 详细文档

🐾 引用

@article{DBLP:journals/corr/abs-2502-11275,
  author       = {Letian Peng and
                  Zilong Wang and
                  Feng Yao and
                  Jingbo Shang},
  title        = {Cuckoo: An {IE} Free Rider Hatched by Massive Nutrition in {LLM}'s Nest},
  journal      = {CoRR},
  volume       = {abs/2502.11275},
  year         = {2025},
  url          = {https://doi.org/10.48550/arXiv.2502.11275},
  doi          = {10.48550/arXiv.2502.11275},
  eprinttype   = {arXiv},
  eprint       = {2502.11275},
  timestamp    = {Mon, 17 Feb 2025 19:32:20 +0000},
  biburl       = {https://dblp.org/rec/journals/corr/abs-2502-11275.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}