Cuckoo - C4開源信息抽取模型 - 小體積大作用，高效抽取信息

首頁

Cuckoo C4

由KomeijiForce開發

布穀鳥是一個小型（3億參數）信息抽取模型，通過模仿大語言模型的下一詞預測範式進行高效信息抽取

大型語言模型

Transformers

開源協議:MIT #信息抽取 #小參數高效 #指令增強

下載量 15

發布時間 : 2/16/2025

模型概述

布穀鳥模型採用創新的下一詞預測機制進行信息抽取，能夠利用各類文本資源自我增強，尤其擅長吸收為大語言模型優化的數據。

模型特點

下一詞預測範式

採用類似大語言模型的預測機制，通過標記上下文中的目標詞元進行信息抽取

數據高效利用

能夠吸收各類文本資源進行自我增強，包括大語言模型優化數據

多版本適配

提供基礎版、指令增強版、彩虹版和超級彩虹版四個版本，適應不同需求

模型能力

命名實體識別

關係抽取

問答系統

文本理解

知識抽取

使用案例

信息抽取

實體識別

從文本中識別人物、地點、組織等實體

在CoNLL2003上達到79.94 F1分數

關係抽取

識別實體之間的關係

在CoNLL2004上達到70.47 F1分數

問答系統

閱讀理解

回答基於文本內容的問題

在SQuAD上達到86.57 F1分數

🚀 布穀鳥模型（Cuckoo）🐦

布穀鳥（Cuckoo）是一個小型（3億參數）的信息提取（IE）模型，它模仿大語言模型的下一個標記預測範式。該模型通過在給定的輸入上下文中標記來預測下一個標記，而非從詞彙表中檢索。本倉庫包含了論文 Cuckoo: An IE Free Rider Hatched by Massive Nutrition in LLM's Nest 中的模型。

布穀鳥

布穀鳥與以往的信息提取預訓練模型有很大不同，因為它可以利用任何文本資源來提升自身，尤其是藉助為大語言模型精心整理的數據！

目前，我們開源了在以下數據上預訓練的布穀鳥模型檢查點：

從C4轉換而來的1億個下一個標記提取（NTE）實例。(布穀鳥 - C4 🐦)
布穀鳥 - C4 + 從有監督微調數據集TuluV3轉換而來的260萬個下一個標記提取（NTE）實例。(布穀鳥 - C4 - 指令 🐦🛠️)
布穀鳥 - C4 - 指令 + MultiNERD、MetaIE、NuNER、MRQA（不包括SQuAD、DROP）。(布穀鳥 - C4 - 彩虹 🌈🐦🛠️)
布穀鳥 - C4 - 彩虹 + 多個命名實體識別（NER）數據集、WizardLM數據集、多項選擇問答數據集、MMLU、SQuAD、DROP、MNLI、SNLI。(布穀鳥 - C4 - 超級彩虹 🦸🌈🐦🛠️)

✨ 主要特性

創新的預測範式：模仿大語言模型的下一個標記預測範式，通過在輸入上下文中標記來預測下一個標記，而非傳統的從詞彙表中檢索方式。
數據利用高效：能夠利用任何文本資源進行自我提升，特別是可以藉助為大語言模型整理的數據，實現數據的高效利用。
多場景適應性：在多種信息提取任務中表現出色，如實體識別、關係理解、問答等，具有廣泛的應用場景。
模型規模小巧：僅有3億參數，在保證性能的同時，具有較低的計算資源需求和更快的推理速度。

🚀 快速開始

性能展示 🚀

開啟布穀鳥模型的探索之旅，體驗它在各類信息提取任務中不可思議的適應效率！

	CoNLL2003	BioNLP2004	MIT - 餐廳	MIT - 電影	平均值	CoNLL2004	ADE	平均值	SQuAD	SQuAD - V2	DROP	平均值
OPT - C4 - TuluV3	50.24	39.76	58.91	56.33	50.56	47.14	45.66	46.40	39.80	53.81	31.00	41.54
RoBERTa	33.75	32.91	62.15	58.32	46.80	34.16	2.15	18.15	31.86	48.55	9.16	29.86
MRQA	72.45	55.93	68.68	66.26	65.83	66.23	67.44	66.84	80.07	66.22	54.46	66.92
MultiNERD	66.78	54.62	64.16	66.30	60.59	57.52	45.10	51.31	42.85	50.99	30.12	41.32
NuNER	74.15	56.36	68.57	64.88	65.99	65.12	63.71	64.42	61.60	52.67	37.37	50.55
MetaIE	71.33	55.63	70.08	65.23	65.57	64.81	64.40	64.61	74.59	62.54	30.73	55.95
布穀鳥 🐦🛠️	73.60	57.00	67.63	67.12	66.34	69.57	71.70	70.63	77.47	64.06	54.25	65.26
└─ 僅預訓練 🐦	72.46	55.87	66.87	67.23	65.61	68.14	69.39	68.77	75.64	63.36	52.81	63.94
└─ 僅後訓練	72.80	56.10	66.02	67.10	65.51	68.66	69.75	69.21	77.05	62.39	54.80	64.75
彩虹布穀鳥 🌈🐦🛠️	79.94	58.39	70.30	67.00	68.91	70.47	76.05	73.26	86.57	69.41	64.64	73.54

快速體驗布穀鳥模型的下一個標記提取 ⚡

我們建議使用最強的超級彩虹布穀鳥 🦸🌈🐦🛠️ 進行零樣本提取。

1️⃣ 首先加載模型和分詞器

from transformers import AutoModelForTokenClassification, AutoTokenizer
import torch
import spacy

nlp = spacy.load("en_core_web_sm")

device = torch.device("cuda:0")
path = f"KomeijiForce/Cuckoo-C4-Super-Rainbow"
tokenizer = AutoTokenizer.from_pretrained(path)
tagger = AutoModelForTokenClassification.from_pretrained(path).to(device)

2️⃣ 定義下一個標記提取函數

def next_tokens_extraction(text):

    def find_sequences(lst):
        sequences = []
        i = 0
        while i < len(lst):
            if lst[i] == 0:
                start = i
                end = i
                i += 1
                while i < len(lst) and lst[i] == 1:
                    end = i
                    i += 1
                sequences.append((start, end+1))
            else:
                i += 1
        return sequences

    text = " ".join([token.text for token in nlp(text)])

    inputs = tokenizer(text, return_tensors="pt").to(device)
    tag_predictions = tagger(**inputs).logits[0].argmax(-1)

    predictions = [tokenizer.decode(inputs.input_ids[0, seq[0]:seq[1]]).strip() for seq in find_sequences(tag_predictions)]
    
    return predictions

3️⃣ 調用函數進行提取！

案例1：基本實體和關係理解

text = "Tom and Jack went to their trip in Paris."

for question in [
    "What are the people mentioned here?",
    "What is the city mentioned here?",
    "Who goes with Tom together?",
    "What do Tom and Jack go to Paris for?",
    "Which city does George live in?",
]:
    text = f"User:\n\n{text}\n\nQuestion: {question}\n\nAssistant:"
    predictions = next_tokens_extraction(text)
    print(question, predictions)

你將得到類似如下的結果：

What are the people mentioned here? ['Tom', 'Jack']
What is the city mentioned here? ['Paris']
Who goes with Tom together? ['Jack']
What do Tom and Jack go to Paris for? ['trip']
Which city does George live in? []

其中 [] 表示布穀鳥模型認為沒有可提取的下一個標記。

案例2：更長的上下文

passage = f'''Ludwig van Beethoven (17 December 1770 – 26 March 1827) was a German composer and pianist. He is one of the most revered figures in the history of Western music; his works rank among the most performed of the classical music repertoire and span the transition from the Classical period to the Romantic era in classical music. His early period, during which he forged his craft, is typically considered to have lasted until 1802. From 1802 to around 1812, his middle period showed an individual development from the styles of Joseph Haydn and Wolfgang Amadeus Mozart, and is sometimes characterised as heroic. During this time, Beethoven began to grow increasingly deaf. In his late period, from 1812 to 1827, he extended his innovations in musical form and expression.'''

for question in [
    "What are the people mentioned here?",
    "What is the job of Beethoven?",
    "How famous is Beethoven?",
    "When did Beethoven's middle period showed an individual development?",
]:
    text = f"User:\n\n{passage}\n\nQuestion: {question}\n\nAssistant:"
    predictions = next_tokens_extraction(text)
    print(question, predictions)

你將得到類似如下的結果：

What are the people mentioned here? ['Ludwig van Beethoven', 'Joseph Haydn', 'Wolfgang Amadeus Mozart']
What is the job of Beethoven? ['composer and pianist']
How famous is Beethoven? ['one of the most revered figures in the history of Western music']
When did Beethoven's middle period showed an individual development? ['1802']

案例3：知識問答

for obj in ["grass", "sea", "fire", "night"]:
    text = f"User:\n\nChoices:\nred\nblue\ngreen.\n\nQuestion: What is the color of the {obj}?\n\nAssistant:\n\nAnswer:"
    predictions = next_tokens_extraction(text)
    print(obj, predictions)

你將得到類似如下的結果：

grass ['green']
sea ['blue']
fire ['red']
night []

這表明布穀鳥模型並非簡單地提取可能的文本片段，而是具備理解上下文的知識。

📚 詳細文檔

文件信息

倉庫包含以下文件信息：

special_tokens_map.json

{
  "bos_token": {
    "content": "<s>",
    "lstrip": false,
    "normalized": true,
    "rstrip": false,
    "single_word": false
  },
  "cls_token": {
    "content": "<s>",
    "lstrip": false,
    "normalized": true,
    "rstrip": false,
    "single_word": false
  },
  "eos_token": {
    "content": "</s>",
    "lstrip": false,
    "normalized": true,
    "rstrip": false,
    "single_word": false
  },
  "mask_token": {
    "content": "<mask>",
    "lstrip": true,
    "normalized": false,
    "rstrip": false,
    "single_word": false
  },
  "pad_token": {
    "content": "<pad>",
    "lstrip": false,
    "normalized": true,
    "rstrip": false,
    "single_word": false
  },
  "sep_token": {
    "content": "</s>",
    "lstrip": false,
    "normalized": true,
    "rstrip": false,
    "single_word": false
  },
  "unk_token": {
    "content": "<unk>",
    "lstrip": false,
    "normalized": true,
    "rstrip": false,
    "single_word": false
  }
}

tokenizer_config.json

{
  "add_prefix_space": true,
  "added_tokens_decoder": {
    "0": {
      "content": "<s>",
      "lstrip": false,
      "normalized": true,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "1": {
      "content": "<pad>",
      "lstrip": false,
      "normalized": true,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "2": {
      "content": "</s>",
      "lstrip": false,
      "normalized": true,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "3": {
      "content": "<unk>",
      "lstrip": false,
      "normalized": true,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "50264": {
      "content": "<mask>",
      "lstrip": true,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    }
  },
  "bos_token": "<s>",
  "clean_up_tokenization_spaces": false,
  "cls_token": "<s>",
  "eos_token": "</s>",
  "errors": "replace",
  "mask_token": "<mask>",
  "max_length": 512,
  "model_max_length": 512,
  "pad_token": "<pad>",
  "sep_token": "</s>",
  "stride": 0,
  "tokenizer_class": "RobertaTokenizer",
  "trim_offsets": true,
  "truncation_side": "right",
  "truncation_strategy": "longest_first",
  "unk_token": "<unk>"
}

merges.txt

內容："文件內容超過50 KB，過長無法顯示。"

vocab.json

內容："文件內容超過50 KB，過長無法顯示。"

config.json

{
  "_name_or_path": "models/ptr-large-c4-stage9",
  "architectures": [
    "RobertaForTokenClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "finetuning_task": "ner",
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 1024,
  "id2label": {
    "0": "B",
    "1": "I",
    "2": "O"
  },
  "initializer_range": 0.02,
  "intermediate_size": 4096,
  "label2id": {
    "B": 0,
    "I": 1,
    "O": 2
  },
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "num_attention_heads": 16,
  "num_hidden_layers": 24,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "torch_dtype": "float32",
  "transformers_version": "4.45.2",
  "type_vocab_size": 1,
  "use_cache": true,
  "vocab_size": 50265
}