Cuckoo - C4 Open-source Information Extraction Model - Small Size, Big Impact, Efficient Information Extraction

Cuckoo C4

Developed by KomeijiForce

Cuckoo is a small (300M parameters) information extraction model that efficiently extracts information by mimicking the next-word prediction paradigm of large language models

Large Language Model

Transformers

Open Source License:MIT #Information Extraction #Small Parameter Efficiency #Instruction Enhancement

Downloads 15

Release Time : 2/16/2025

Model Overview

The Cuckoo model employs an innovative next-word prediction mechanism for information extraction, capable of self-enhancement using various text resources, particularly excelling in absorbing data optimized for large language models.

Model Features

Next-word Prediction Paradigm

Adopts a prediction mechanism similar to large language models, extracting information by identifying target tokens in context

Efficient Data Utilization

Capable of absorbing various text resources for self-enhancement, including data optimized for large language models

Multi-version Adaptation

Offers four versions: Basic, Instruction-enhanced, Rainbow, and Super Rainbow, catering to different needs

Model Capabilities

Named Entity Recognition

Relation Extraction

Question Answering

Text Understanding

Knowledge Extraction

Use Cases

Information Extraction

Entity Recognition

Identify entities such as people, places, and organizations from text

Achieved 79.94 F1 score on CoNLL2003

Relation Extraction

Identify relationships between entities

Achieved 70.47 F1 score on CoNLL2004

Question Answering

Reading Comprehension

Answer questions based on text content

Achieved 86.57 F1 score on SQuAD

🚀 Cuckoo 🐦

This repository hosts the model from the paper Cuckoo: An IE Free Rider Hatched by Massive Nutrition in LLM's Nest. Cuckoo is a compact (300M) information extraction (IE) model that mimics the next token prediction approach of large language models. Instead of retrieving from the vocabulary, it predicts the next tokens by tagging them within the given input context, as shown below:

cuckoo

Cuckoo stands out from previous IE pre - training methods as it can leverage any text resource for self - enhancement, especially by capitalizing on data curated for LLMs!

Currently, we are open - sourcing checkpoints of Cuckoos pre - trained on the following:

100M next tokens extraction (NTE) instances converted from C4. ([Cuckoo - C4](https://huggingface.co/KomeijiForce/Cuckoo - C4) 🐦)
Cuckoo - C4 + 2.6M next token extraction (NTE) instances converted from a supervised fine - tuning dataset, TuluV3. ([Cuckoo - C4 - Instruct](https://huggingface.co/KomeijiForce/Cuckoo - C4 - Instruct) 🐦🛠️)
Cuckoo - C4 - Instruct + MultiNERD, MetaIE, NuNER, MRQA (excluding SQuAD, DROP). ([Cuckoo - C4 - Rainbow](https://huggingface.co/KomeijiForce/Cuckoo - C4 - Rainbow) 🌈🐦🛠️)
Cuckoo - C4 - Rainbow + Multiple NER Datasets, WizardLM Dataset, Multiple Choice QA Datasets, MMLU, SQuAD, DROP, MNLI, SNLI. ([Cuckoo - C4 - Super - Rainbow](https://huggingface.co/KomeijiForce/Cuckoo - C4 - Super - Rainbow) 🦸🌈🐦🛠️)

🚀 Quick Start

✨ Features

Cuckoo is a small - sized information extraction model that imitates the next token prediction paradigm of large language models.
It can use any text resource for self - enhancement, especially data curated for LLMs.
It provides high - performance results on various IE tasks.

📦 Installation

Since no specific installation steps are provided in the original README, this section is skipped.

💻 Usage Examples

Basic Usage

from transformers import AutoModelForTokenClassification, AutoTokenizer
import torch
import spacy

nlp = spacy.load("en_core_web_sm")

device = torch.device("cuda:0")
path = f"KomeijiForce/Cuckoo-C4-Super-Rainbow"
tokenizer = AutoTokenizer.from_pretrained(path)
tagger = AutoModelForTokenClassification.from_pretrained(path).to(device)

def next_tokens_extraction(text):

    def find_sequences(lst):
        sequences = []
        i = 0
        while i < len(lst):
            if lst[i] == 0:
                start = i
                end = i
                i += 1
                while i < len(lst) and lst[i] == 1:
                    end = i
                    i += 1
                sequences.append((start, end+1))
            else:
                i += 1
        return sequences

    text = " ".join([token.text for token in nlp(text)])

    inputs = tokenizer(text, return_tensors="pt").to(device)
    tag_predictions = tagger(**inputs).logits[0].argmax(-1)

    predictions = [tokenizer.decode(inputs.input_ids[0, seq[0]:seq[1]]).strip() for seq in find_sequences(tag_predictions)]
    
    return predictions

text = "Tom and Jack went to their trip in Paris."

for question in [
    "What are the people mentioned here?",
    "What is the city mentioned here?",
    "Who goes with Tom together?",
    "What do Tom and Jack go to Paris for?",
    "Which city does George live in?",
]:
    text = f"User:\n\n{text}\n\nQuestion: {question}\n\nAssistant:"
    predictions = next_tokens_extraction(text)
    print(question, predictions)

You will get results like:

What are the people mentioned here? ['Tom', 'Jack']
What is the city mentioned here? ['Paris']
Who goes with Tom together? ['Jack']
What do Tom and Jack go to Paris for? ['trip']
Which city does George live in? []

Advanced Usage

passage = f'''Ludwig van Beethoven (17 December 1770 – 26 March 1827) was a German composer and pianist. He is one of the most revered figures in the history of Western music; his works rank among the most performed of the classical music repertoire and span the transition from the Classical period to the Romantic era in classical music. His early period, during which he forged his craft, is typically considered to have lasted until 1802. From 1802 to around 1812, his middle period showed an individual development from the styles of Joseph Haydn and Wolfgang Amadeus Mozart, and is sometimes characterised as heroic. During this time, Beethoven began to grow increasingly deaf. In his late period, from 1812 to 1827, he extended his innovations in musical form and expression.'''

for question in [
    "What are the people mentioned here?",
    "What is the job of Beethoven?",
    "How famous is Beethoven?",
    "When did Beethoven's middle period showed an individual development?",
]:
    text = f"User:\n\n{passage}\n\nQuestion: {question}\n\nAssistant:"
    predictions = next_tokens_extraction(text)
    print(question, predictions)

You will get results like:

What are the people mentioned here? ['Ludwig van Beethoven', 'Joseph Haydn', 'Wolfgang Amadeus Mozart']
What is the job of Beethoven? ['composer and pianist']
How famous is Beethoven? ['one of the most revered figures in the history of Western music']
When did Beethoven's middle period showed an individual development? ['1802']

for obj in ["grass", "sea", "fire", "night"]:
    text = f"User:\n\nChoices:\nred\nblue\ngreen.\n\nQuestion: What is the color of the {obj}?\n\nAssistant:\n\nAnswer:"
    predictions = next_tokens_extraction(text)
    print(obj, predictions)

You will get results like:

grass ['green']
sea ['blue']
fire ['red']
night []

📚 Documentation

The repository contains the following file information:

Property	Details
Filename: special_tokens_map.json	{ "bos_token": { "content": "", "lstrip": false, "normalized": true, "rstrip": false, "single_word": false }, "cls_token": { "content": "~~", "lstrip": false, "normalized": true, "rstrip": false, "single_word": false }, "eos_token": { "content": "~~", "lstrip": false, "normalized": true, "rstrip": false, "single_word": false }, "mask_token": { "content": "", "lstrip": true, "normalized": false, "rstrip": false, "single_word": false }, "pad_token": { "content": "", "lstrip": false, "normalized": true, "rstrip": false, "single_word": false }, "sep_token": { "content": "", "lstrip": false, "normalized": true, "rstrip": false, "single_word": false }, "unk_token": { "content": "", "lstrip": false, "normalized": true, "rstrip": false, "single_word": false } }
Filename: tokenizer_config.json	{ "add_prefix_space": true, "added_tokens_decoder": { "0": { "content": "", "lstrip": false, "normalized": true, "rstrip": false, "single_word": false, "special": true }, "1": { "content": "", "lstrip": false, "normalized": true, "rstrip": false, "single_word": false, "special": true }, "2": { "content": "", "lstrip": false, "normalized": true, "rstrip": false, "single_word": false, "special": true }, "3": { "content": "", "lstrip": false, "normalized": true, "rstrip": false, "single_word": false, "special": true }, "50264": { "content": "", "lstrip": true, "normalized": false, "rstrip": false, "single_word": false, "special": true } }, "bos_token": "~~", "clean_up_tokenization_spaces": false, "cls_token": "~~", "eos_token": "~~", "errors": "replace", "mask_token": "", "max_length": 512, "model_max_length": 512, "pad_token": "", "sep_token": "~~", "stride": 0, "tokenizer_class": "RobertaTokenizer", "trim_offsets": true, "truncation_side": "right", "truncation_strategy": "longest_first", "unk_token": "" }
Filename: merges.txt	"Content of the file is larger than 50 KB, too long to display."
Filename: vocab.json	"Content of the file is larger than 50 KB, too long to display."
Filename: config.json	{ "_name_or_path": "models/ptr-large-c4-stage9", "architectures": [ "RobertaForTokenClassification" ], "attention_probs_dropout_prob": 0.1, "bos_token_id": 0, "classifier_dropout": null, "eos_token_id": 2, "finetuning_task": "ner", "hidden_act": "gelu", "hidden_dropout_prob": 0.1, "hidden_size": 1024, "id2label": { "0": "B", "1": "I", "2": "O" }, "initializer_range": 0.02, "intermediate_size": 4096, "label2id": { "B": 0, "I": 1, "O": 2 }, "layer_norm_eps": 1e - 05, "max_position_embeddings": 514, "model_type": "roberta", "num_attention_heads": 16, "num_hidden_layers": 24, "pad_token_id": 1, "position_embedding_type": "absolute", "torch_dtype": "float32", "transformers_version": "4.45.2", "type_vocab_size": 1, "use_cache": true, "vocab_size": 50265 }
Filename: tokenizer.json	"Content of the file is larger than 50 KB, too long to display."

🔧 Technical Details

Since no specific technical details are provided in the original README, this section is skipped.

📄 License

The project is licensed under the MIT license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご