Model Overview
Model Features
Model Capabilities
Use Cases
đ Cuckoo đĻ
This repository hosts the model from the paper Cuckoo: An IE Free Rider Hatched by Massive Nutrition in LLM's Nest. Cuckoo is a compact (300M) information extraction (IE) model that mimics the next token prediction approach of large language models. Instead of retrieving from the vocabulary, it predicts the next tokens by tagging them within the given input context, as shown below:
Cuckoo stands out from previous IE pre - training methods as it can leverage any text resource for self - enhancement, especially by capitalizing on data curated for LLMs!
Currently, we are open - sourcing checkpoints of Cuckoos pre - trained on the following:
- 100M next tokens extraction (NTE) instances converted from C4. ([Cuckoo - C4](https://huggingface.co/KomeijiForce/Cuckoo - C4) đĻ)
- Cuckoo - C4 + 2.6M next token extraction (NTE) instances converted from a supervised fine - tuning dataset, TuluV3. ([Cuckoo - C4 - Instruct](https://huggingface.co/KomeijiForce/Cuckoo - C4 - Instruct) đĻđ ī¸)
- Cuckoo - C4 - Instruct + MultiNERD, MetaIE, NuNER, MRQA (excluding SQuAD, DROP). ([Cuckoo - C4 - Rainbow](https://huggingface.co/KomeijiForce/Cuckoo - C4 - Rainbow) đđĻđ ī¸)
- Cuckoo - C4 - Rainbow + Multiple NER Datasets, WizardLM Dataset, Multiple Choice QA Datasets, MMLU, SQuAD, DROP, MNLI, SNLI. ([Cuckoo - C4 - Super - Rainbow](https://huggingface.co/KomeijiForce/Cuckoo - C4 - Super - Rainbow) đϏđđĻđ ī¸)
đ Quick Start
⨠Features
- Cuckoo is a small - sized information extraction model that imitates the next token prediction paradigm of large language models.
- It can use any text resource for self - enhancement, especially data curated for LLMs.
- It provides high - performance results on various IE tasks.
đĻ Installation
Since no specific installation steps are provided in the original README, this section is skipped.
đģ Usage Examples
Basic Usage
from transformers import AutoModelForTokenClassification, AutoTokenizer
import torch
import spacy
nlp = spacy.load("en_core_web_sm")
device = torch.device("cuda:0")
path = f"KomeijiForce/Cuckoo-C4-Super-Rainbow"
tokenizer = AutoTokenizer.from_pretrained(path)
tagger = AutoModelForTokenClassification.from_pretrained(path).to(device)
def next_tokens_extraction(text):
def find_sequences(lst):
sequences = []
i = 0
while i < len(lst):
if lst[i] == 0:
start = i
end = i
i += 1
while i < len(lst) and lst[i] == 1:
end = i
i += 1
sequences.append((start, end+1))
else:
i += 1
return sequences
text = " ".join([token.text for token in nlp(text)])
inputs = tokenizer(text, return_tensors="pt").to(device)
tag_predictions = tagger(**inputs).logits[0].argmax(-1)
predictions = [tokenizer.decode(inputs.input_ids[0, seq[0]:seq[1]]).strip() for seq in find_sequences(tag_predictions)]
return predictions
text = "Tom and Jack went to their trip in Paris."
for question in [
"What are the people mentioned here?",
"What is the city mentioned here?",
"Who goes with Tom together?",
"What do Tom and Jack go to Paris for?",
"Which city does George live in?",
]:
text = f"User:\n\n{text}\n\nQuestion: {question}\n\nAssistant:"
predictions = next_tokens_extraction(text)
print(question, predictions)
You will get results like:
What are the people mentioned here? ['Tom', 'Jack']
What is the city mentioned here? ['Paris']
Who goes with Tom together? ['Jack']
What do Tom and Jack go to Paris for? ['trip']
Which city does George live in? []
Advanced Usage
passage = f'''Ludwig van Beethoven (17 December 1770 â 26 March 1827) was a German composer and pianist. He is one of the most revered figures in the history of Western music; his works rank among the most performed of the classical music repertoire and span the transition from the Classical period to the Romantic era in classical music. His early period, during which he forged his craft, is typically considered to have lasted until 1802. From 1802 to around 1812, his middle period showed an individual development from the styles of Joseph Haydn and Wolfgang Amadeus Mozart, and is sometimes characterised as heroic. During this time, Beethoven began to grow increasingly deaf. In his late period, from 1812 to 1827, he extended his innovations in musical form and expression.'''
for question in [
"What are the people mentioned here?",
"What is the job of Beethoven?",
"How famous is Beethoven?",
"When did Beethoven's middle period showed an individual development?",
]:
text = f"User:\n\n{passage}\n\nQuestion: {question}\n\nAssistant:"
predictions = next_tokens_extraction(text)
print(question, predictions)
You will get results like:
What are the people mentioned here? ['Ludwig van Beethoven', 'Joseph Haydn', 'Wolfgang Amadeus Mozart']
What is the job of Beethoven? ['composer and pianist']
How famous is Beethoven? ['one of the most revered figures in the history of Western music']
When did Beethoven's middle period showed an individual development? ['1802']
for obj in ["grass", "sea", "fire", "night"]:
text = f"User:\n\nChoices:\nred\nblue\ngreen.\n\nQuestion: What is the color of the {obj}?\n\nAssistant:\n\nAnswer:"
predictions = next_tokens_extraction(text)
print(obj, predictions)
You will get results like:
grass ['green']
sea ['blue']
fire ['red']
night []
đ Documentation
The repository contains the following file information:
Property | Details |
---|---|
Filename: special_tokens_map.json | { "bos_token": { "content": " |
Filename: tokenizer_config.json | { "add_prefix_space": true, "added_tokens_decoder": { "0": { "content": " |
Filename: merges.txt | "Content of the file is larger than 50 KB, too long to display." |
Filename: vocab.json | "Content of the file is larger than 50 KB, too long to display." |
Filename: config.json | { "_name_or_path": "models/ptr-large-c4-stage9", "architectures": [ "RobertaForTokenClassification" ], "attention_probs_dropout_prob": 0.1, "bos_token_id": 0, "classifier_dropout": null, "eos_token_id": 2, "finetuning_task": "ner", "hidden_act": "gelu", "hidden_dropout_prob": 0.1, "hidden_size": 1024, "id2label": { "0": "B", "1": "I", "2": "O" }, "initializer_range": 0.02, "intermediate_size": 4096, "label2id": { "B": 0, "I": 1, "O": 2 }, "layer_norm_eps": 1e - 05, "max_position_embeddings": 514, "model_type": "roberta", "num_attention_heads": 16, "num_hidden_layers": 24, "pad_token_id": 1, "position_embedding_type": "absolute", "torch_dtype": "float32", "transformers_version": "4.45.2", "type_vocab_size": 1, "use_cache": true, "vocab_size": 50265 } |
Filename: tokenizer.json | "Content of the file is larger than 50 KB, too long to display." |
đ§ Technical Details
Since no specific technical details are provided in the original README, this section is skipped.
đ License
The project is licensed under the MIT license.

