クッコー - C4オープンソース情報抽出モデル - 小さなサイズで大きな役割、効率的な情報抽出

ホーム

Cuckoo C4

KomeijiForceによって開発

カッコーモデルは小型（3億パラメータ）の情報抽出モデルで、大規模言語モデルの次語予測パラダイムを模倣して効率的な情報抽出を行います

大規模言語モデル

Transformers

オープンソースライセンス:MIT #情報抽出 #小規模パラメータ効率化 #命令拡張

ダウンロード数 15

リリース時間 : 2/16/2025

モデル概要

カッコーモデルは革新的な次語予測メカニズムを採用して情報抽出を行い、様々なテキストリソースを自己増強に活用でき、特に大規模言語モデル向けに最適化されたデータの吸収に優れています。

モデル特徴

次語予測パラダイム

大規模言語モデルに似た予測メカニズムを採用し、文脈中のターゲットトークンをマーキングして情報抽出を行います

データ効率的活用

大規模言語モデル最適化データを含む様々なテキストリソースを吸収して自己増強できます

マルチバージョン適応

基本版、命令拡張版、レインボー版、スーパーレインボー版の4バージョンを提供し、様々なニーズに対応します

モデル能力

固有表現認識

関係抽出

質問応答システム

テキスト理解

知識抽出

使用事例

情報抽出

固有表現認識

テキストから人物、場所、組織などのエンティティを識別

CoNLL2003で79.94 F1スコアを達成

関係抽出

エンティティ間の関係を識別

CoNLL2004で70.47 F1スコアを達成

質問応答システム

読解

テキスト内容に基づく質問に回答

SQuADで86.57 F1スコアを達成

🚀 クックー 🐦 [Github]

このリポジトリには、論文 Cuckoo: An IE Free Rider Hatched by Massive Nutrition in LLM's Nest のモデルが含まれています。

クックーは、大規模言語モデルの次のトークン予測パラダイムを模倣した小規模（300M）の情報抽出（IE）モデルです。語彙からの検索ではなく、クックーは与えられた入力コンテキスト内でトークンにタグを付けることで次のトークンを予測します。下図のようにです。

cuckoo

クックーは、以前のIE事前学習とは大幅に異なり、あらゆるテキストリソースを利用して自身を強化することができ、特にLLM用に整理されたデータを無料で利用することができます！

現在、以下のデータで事前学習されたクックーのチェックポイントをオープンソースで公開しています。

C4から変換された1億件の次のトークン抽出（NTE）インスタンス。(Cuckoo-C4 🐦)
Cuckoo-C4 + 教師付き微調整データセットであるTuluV3から変換された260万件の次のトークン抽出（NTE）インスタンス。(Cuckoo-C4-Instruct 🐦🛠️)
Cuckoo-C4-Instruct + MultiNERD、MetaIE、NuNER、MRQA（SQuAD、DROPを除く）。(Cuckoo-C4-Rainbow 🌈🐦🛠️)
Cuckoo-C4-Rainbow + 複数のNERデータセット、WizardLMデータセット、複数の選択肢QAデータセット、MMLU、SQuAD、DROP、MNLI、SNLI。(Cuckoo-C4-Super-Rainbow 🦸🌈🐦🛠️)

✨ 主な機能

🚀 性能デモ

クックーを使って、あらゆるIEタスクに対する想像を超えた適応効率を体験してください！

	CoNLL2003	BioNLP2004	MIT-Restaurant	MIT-Movie	平均	CoNLL2004	ADE	平均	SQuAD	SQuAD-V2	DROP	平均
OPT-C4-TuluV3	50.24	39.76	58.91	56.33	50.56	47.14	45.66	46.40	39.80	53.81	31.00	41.54
RoBERTa	33.75	32.91	62.15	58.32	46.80	34.16	2.15	18.15	31.86	48.55	9.16	29.86
MRQA	72.45	55.93	68.68	66.26	65.83	66.23	67.44	66.84	80.07	66.22	54.46	66.92
MultiNERD	66.78	54.62	64.16	66.30	60.59	57.52	45.10	51.31	42.85	50.99	30.12	41.32
NuNER	74.15	56.36	68.57	64.88	65.99	65.12	63.71	64.42	61.60	52.67	37.37	50.55
MetaIE	71.33	55.63	70.08	65.23	65.57	64.81	64.40	64.61	74.59	62.54	30.73	55.95
Cuckoo 🐦🛠️	73.60	57.00	67.63	67.12	66.34	69.57	71.70	70.63	77.47	64.06	54.25	65.26
└─ 事前学習のみ 🐦	72.46	55.87	66.87	67.23	65.61	68.14	69.39	68.77	75.64	63.36	52.81	63.94
└─ 事後学習のみ	72.80	56.10	66.02	67.10	65.51	68.66	69.75	69.21	77.05	62.39	54.80	64.75
Rainbow Cuckoo 🌈🐦🛠️	79.94	58.39	70.30	67.00	68.91	70.47	76.05	73.26	86.57	69.41	64.64	73.54

📦 インストール

本プロジェクトでは、事前学習済みのモデルを提供しています。以下の手順でモデルを利用することができます。

まず、必要なライブラリをインストールします。

pip install transformers torch spacy
python -m spacy download en_core_web_sm

💻 使用例

基本的な使用法

from transformers import AutoModelForTokenClassification, AutoTokenizer
import torch
import spacy

nlp = spacy.load("en_core_web_sm")

device = torch.device("cuda:0")
path = f"KomeijiForce/Cuckoo-C4-Super-Rainbow"
tokenizer = AutoTokenizer.from_pretrained(path)
tagger = AutoModelForTokenClassification.from_pretrained(path).to(device)

高度な使用法

次のトークン抽出関数の定義

def next_tokens_extraction(text):

    def find_sequences(lst):
        sequences = []
        i = 0
        while i < len(lst):
            if lst[i] == 0:
                start = i
                end = i
                i += 1
                while i < len(lst) and lst[i] == 1:
                    end = i
                    i += 1
                sequences.append((start, end+1))
            else:
                i += 1
        return sequences

    text = " ".join([token.text for token in nlp(text)])

    inputs = tokenizer(text, return_tensors="pt").to(device)
    tag_predictions = tagger(**inputs).logits[0].argmax(-1)

    predictions = [tokenizer.decode(inputs.input_ids[0, seq[0]:seq[1]]).strip() for seq in find_sequences(tag_predictions)]
    
    return predictions

抽出関数の呼び出し

ケース1: 基本的なエンティティと関係の理解

text = "Tom and Jack went to their trip in Paris."

for question in [
    "What are the people mentioned here?",
    "What is the city mentioned here?",
    "Who goes with Tom together?",
    "What do Tom and Jack go to Paris for?",
    "Which city does George live in?",
]:
    text = f"User:\n\n{text}\n\nQuestion: {question}\n\nAssistant:"
    predictions = next_tokens_extraction(text)
    print(question, predictions)

実行すると、以下のような結果が得られます。

What are the people mentioned here? ['Tom', 'Jack']
What is the city mentioned here? ['Paris']
Who goes with Tom together? ['Jack']
What do Tom and Jack go to Paris for? ['trip']
Which city does George live in? []

ここで、[] はクックーが抽出する次のトークンがないと判断したことを示しています。

ケース2: より長いコンテキスト

passage = f'''Ludwig van Beethoven (17 December 1770 – 26 March 1827) was a German composer and pianist. He is one of the most revered figures in the history of Western music; his works rank among the most performed of the classical music repertoire and span the transition from the Classical period to the Romantic era in classical music. His early period, during which he forged his craft, is typically considered to have lasted until 1802. From 1802 to around 1812, his middle period showed an individual development from the styles of Joseph Haydn and Wolfgang Amadeus Mozart, and is sometimes characterised as heroic. During this time, Beethoven began to grow increasingly deaf. In his late period, from 1812 to 1827, he extended his innovations in musical form and expression.'''

for question in [
    "What are the people mentioned here?",
    "What is the job of Beethoven?",
    "How famous is Beethoven?",
    "When did Beethoven's middle period showed an individual development?",
]:
    text = f"User:\n\n{passage}\n\nQuestion: {question}\n\nAssistant:"
    predictions = next_tokens_extraction(text)
    print(question, predictions)

実行すると、以下のような結果が得られます。

What are the people mentioned here? ['Ludwig van Beethoven', 'Joseph Haydn', 'Wolfgang Amadeus Mozart']
What is the job of Beethoven? ['composer and pianist']
How famous is Beethoven? ['one of the most revered figures in the history of Western music']
When did Beethoven's middle period showed an individual development? ['1802']

ケース3: 知識クイズ

for obj in ["grass", "sea", "fire", "night"]:
    text = f"User:\n\nChoices:\nred\nblue\ngreen.\n\nQuestion: What is the color of the {obj}?\n\nAssistant:\n\nAnswer:"
    predictions = next_tokens_extraction(text)
    print(obj, predictions)

実行すると、以下のような結果が得られます。

grass ['green']
sea ['blue']
fire ['red']
night []

これは、クックーが妥当なスパンを抽出していないが、コンテキストを理解する知識を持っていることを示しています。

📚 ドキュメント

ファイル情報

リポジトリには以下のファイル情報が含まれています。

ファイル名	内容
special_tokens_map.json	{ "bos_token": { "content": "", "lstrip": false, "normalized": true, "rstrip": false, "single_word": false }, "cls_token": { "content": "", "lstrip": false, "normalized": true, "rstrip": false, "single_word": false }, "eos_token": { "content": "", "lstrip": false, "normalized": true, "rstrip": false, "single_word": false }, "mask_token": { "content": "", "lstrip": true, "normalized": false, "rstrip": false, "single_word": false }, "pad_token": { "content": "", "lstrip": false, "normalized": true, "rstrip": false, "single_word": false }, "sep_token": { "content": "", "lstrip": false, "normalized": true, "rstrip": false, "single_word": false }, "unk_token": { "content": "", "lstrip": false, "normalized": true, "rstrip": false, "single_word": false } }
tokenizer_config.json	{ "add_prefix_space": true, "added_tokens_decoder": { "0": { "content": "", "lstrip": false, "normalized": true, "rstrip": false, "single_word": false, "special": true }, "1": { "content": "", "lstrip": false, "normalized": true, "rstrip": false, "single_word": false, "special": true }, "2": { "content": "", "lstrip": false, "normalized": true, "rstrip": false, "single_word": false, "special": true }, "3": { "content": "", "lstrip": false, "normalized": true, "rstrip": false, "single_word": false, "special": true }, "50264": { "content": "", "lstrip": true, "normalized": false, "rstrip": false, "single_word": false, "special": true } }, "bos_token": "", "clean_up_tokenization_spaces": false, "cls_token": "", "eos_token": "", "errors": "replace", "mask_token": "", "max_length": 512, "model_max_length": 512, "pad_token": "", "sep_token": "", "stride": 0, "tokenizer_class": "RobertaTokenizer", "trim_offsets": true, "truncation_side": "right", "truncation_strategy": "longest_first", "unk_token": "" }
merges.txt	"ファイルの内容は50KBを超えており、表示するには長すぎます。"
vocab.json	"ファイルの内容は50KBを超えており、表示するには長すぎます。"
config.json	{ "_name_or_path": "models/ptr-large-c4-stage9", "architectures": [ "RobertaForTokenClassification" ], "attention_probs_dropout_prob": 0.1, "bos_token_id": 0, "classifier_dropout": null, "eos_token_id": 2, "finetuning_task": "ner", "hidden_act": "gelu", "hidden_dropout_prob": 0.1, "hidden_size": 1024, "id2label": { "0": "B", "1": "I", "2": "O" }, "initializer_range": 0.02, "intermediate_size": 4096, "label2id": { "B": 0, "I": 1, "O": 2 }, "layer_norm_eps": 1e-05, "max_position_embeddings": 514, "model_type": "roberta", "num_attention_heads": 16, "num_hidden_layers": 24, "pad_token_id": 1, "position_embedding_type": "absolute", "torch_dtype": "float32", "transformers_version": "4.45.2", "type_vocab_size": 1, "use_cache": true, "vocab_size": 50265 }
tokenizer.json	"ファイルの内容は50KBを超えており、表示するには長すぎます。"