NuExtract-2.0-8Bオープンソースマルチモーダルモデル - 画像とテキスト入力をサポートし、無料で多言語情報を抽出可能

ホーム

Nuextract 2.0 8B

numindによって開発

NuExtract 2.0は、構造化情報抽出タスク用に訓練されたマルチモーダルモデルシリーズで、テキストと画像の入力をサポートし、多言語処理能力を備えています。

マルチモーダル融合

Transformers

オープンソースライセンス:MIT #マルチモーダル情報抽出 #構造化データ生成 #多言語対応

ダウンロード数 328

リリース時間 : 5/6/2025

モデル概要

Qwen2.5-VL-7B-Instructをベースに微調整された構造化情報抽出モデルで、テキストまたは画像から指定された形式の構造化データを抽出できます。

モデル特徴

マルチモーダルサポート

テキストと画像の入力を同時にサポートし、複数のデータソースから構造化情報を抽出できます。

テンプレート駆動

JSONテンプレートで出力構造を定義し、様々な抽出ニーズに柔軟に対応できます。

コンテキスト学習

サンプルサンプルを提供することで（コンテキスト内学習）、複雑なシナリオでの抽出精度を向上させます。

型システム

豊富なデータ型（文字列/数値/日付/列挙型など）をサポートしています。

モデル能力

テキスト情報抽出

画像内容解析

多言語処理

構造化データ生成

テンプレート自動生成

使用事例

文書処理

契約書情報抽出

法律契約書から重要な条項、日付、署名者情報を抽出します。

構造化JSONデータを出力します。

請求書認識

スキャンした請求書から事業者、金額、日付などの情報を抽出します。

財務システムが読み取り可能なデータを自動生成します。

小売シーン

商品ラベル認識

商品画像から価格、仕様などの情報を抽出します。

標準化された製品データベースを生成します。

🚀 NuExtract 2.0 8B by NuMind

NuExtract 2.0は、構造化情報抽出タスク用に特別に学習されたモデルファミリーです。マルチモーダル入力をサポートし、多言語対応です。

API / Platform | Blog | Discord

✨ 主な機能

構造化情報抽出タスクに特化したモデルファミリーです。
マルチモーダル入力（テキストと画像）をサポートします。
多言語対応で、様々な言語の情報抽出が可能です。

📦 インストール

このセクションでは、transformersライブラリを使用してNuExtractをインストールする方法を説明します。

import torch
from transformers import AutoProcessor, AutoModelForVision2Seq

model_name = "numind/NuExtract-2.0-2B"
# model_name = "numind/NuExtract-2.0-8B"

model = AutoModelForVision2Seq.from_pretrained(model_name, 
                                               trust_remote_code=True, 
                                               torch_dtype=torch.bfloat16,
                                               attn_implementation="flash_attention_2",
                                               device_map="auto")
processor = AutoProcessor.from_pretrained(model_name, 
                                          trust_remote_code=True, 
                                          padding_side='left',
                                          use_fast=True)

# You can set min_pixels and max_pixels according to your needs, such as a token range of 256-1280, to balance performance and cost.
# min_pixels = 256*28*28
# max_pixels = 1280*28*28
# processor = AutoProcessor.from_pretrained(model_name, min_pixels=min_pixels, max_pixels=max_pixels)

💻 使用例

基本的な使用法

テキストドキュメントから名前を抽出する基本的な例を示します。

template = """{"names": ["string"]}"""
document = "John went to the restaurant with Mary. James went to the cinema."

# prepare the user message content
messages = [{"role": "user", "content": document}]
text = processor.tokenizer.apply_chat_template(
    messages,
    template=template, # template is specified here
    tokenize=False,
    add_generation_prompt=True,
)

print(text)
""""<|im_start|>user
# Template:
{"names": ["string"]}
# Context:
John went to the restaurant with Mary. James went to the cinema.<|im_end|> 
<|im_start|>assistant"""

image_inputs = process_all_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    padding=True,
    return_tensors="pt",
).to("cuda")

# we choose greedy sampling here, which works well for most information extraction tasks
generation_config = {"do_sample": False, "num_beams": 1, "max_new_tokens": 2048}

# Inference: Generation of the output
generated_ids = model.generate(
    **inputs,
    **generation_config
)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)

print(output_text)
# ['{"names": ["John", "Mary", "James"]}']

高度な使用法

文脈内の例を使用する場合

template = """{"names": ["string"]}"""
document = "John went to the restaurant with Mary. James went to the cinema."
examples = [
    {
        "input": "Stephen is the manager at Susan's store.",
        "output": """{"names": ["-STEPHEN-", "-SUSAN-"]}"""
    }
]

messages = [{"role": "user", "content": document}]
text = processor.tokenizer.apply_chat_template(
    messages,
    template=template,
    examples=examples, # examples provided here
    tokenize=False,
    add_generation_prompt=True,
)

image_inputs = process_all_vision_info(messages, examples)
inputs = processor(
    text=[text],
    images=image_inputs,
    padding=True,
    return_tensors="pt",
).to("cuda")

# we choose greedy sampling here, which works well for most information extraction tasks
generation_config = {"do_sample": False, "num_beams": 1, "max_new_tokens": 2048}

# Inference: Generation of the output
generated_ids = model.generate(
    **inputs,
    **generation_config
)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
# ['{"names": ["-JOHN-", "-MARY-", "-JAMES-"]}']

画像入力を使用する場合

template = """{"store": "verbatim-string"}"""
document = {"type": "image", "image": "file://1.jpg"}

messages = [{"role": "user", "content": [document]}]
text = processor.tokenizer.apply_chat_template(
    messages,
    template=template,
    tokenize=False,
    add_generation_prompt=True,
)

image_inputs = process_all_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    padding=True,
    return_tensors="pt",
).to("cuda")

generation_config = {"do_sample": False, "num_beams": 1, "max_new_tokens": 2048}

# Inference: Generation of the output
generated_ids = model.generate(
    **inputs,
    **generation_config
)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
# ['{"store": "Trader Joe\'s"}']

バッチ推論を使用する場合

inputs = [
    # image input with no ICL examples
    {
        "document": {"type": "image", "image": "file://0.jpg"},
        "template": """{"store_name": "verbatim-string"}""",
    },
    # image input with 1 ICL example
    {
        "document": {"type": "image", "image": "file://0.jpg"},
        "template": """{"store_name": "verbatim-string"}""",
        "examples": [
            {
                "input": {"type": "image", "image": "file://1.jpg"},
                "output": """{"store_name": "Trader Joe's"}""",
            }
        ],
    },
    # text input with no ICL examples
    {
        "document": {"type": "text", "text": "John went to the restaurant with Mary. James went to the cinema."},
        "template": """{"names": ["string"]}""",
    },
    # text input with ICL example
    {
        "document": {"type": "text", "text": "John went to the restaurant with Mary. James went to the cinema."},
        "template": """{"names": ["string"]}""",
        "examples": [
            {
                "input": "Stephen is the manager at Susan's store.",
                "output": """{"names": ["STEPHEN", "SUSAN"]}"""
            }
        ],
    },
]

# messages should be a list of lists for batch processing
messages = [
    [
        {
            "role": "user",
            "content": [x['document']],
        }
    ]
    for x in inputs
]

# apply chat template to each example individually
texts = [
    processor.tokenizer.apply_chat_template(
        messages[i],  # Now this is a list containing one message
        template=x['template'],
        examples=x.get('examples', None),
        tokenize=False, 
        add_generation_prompt=True)
    for i, x in enumerate(inputs)
]

image_inputs = process_all_vision_info(messages, [x.get('examples') for x in inputs])
inputs = processor(
    text=texts,
    images=image_inputs,
    padding=True,
    return_tensors="pt",
).to("cuda")

generation_config = {"do_sample": False, "num_beams": 1, "max_new_tokens": 2048}

# Batch Inference
generated_ids = model.generate(**inputs, **generation_config)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_texts = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
for y in output_texts:
    print(y)
# {"store_name": "WAL-MART"}
# {"store_name": "Walmart"}
# {"names": ["John", "Mary", "James"]}
# {"names": ["JOHN", "MARY", "JAMES"]}

テンプレート生成を使用する場合

# XMLをNuExtractテンプレートに変換する例
xml_template = """<SportResult>
    <Date></Date>
    <Sport></Sport>
    <Venue></Venue>
    <HomeTeam></HomeTeam>
    <AwayTeam></AwayTeam>
    <HomeScore></HomeScore>
    <AwayScore></AwayScore>
    <TopScorer></TopScorer>
</SportResult>"""

messages = [
        {
            "role": "user",
            "content": [{"type": "text", "text": xml_template}],
        }
    ]

text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True,
)

image_inputs = process_all_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    padding=True,
    return_tensors="pt",
).to("cuda")

generated_ids = model.generate(
    **inputs,
    **generation_config
)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)

print(output_text[0])
# {
#     "Date": "date-time",
#     "Sport": "verbatim-string",
#     "Venue": "verbatim-string",
#     "HomeTeam": "verbatim-string",
#     "AwayTeam": "verbatim-string",
#     "HomeScore": "integer",
#     "AwayScore": "integer",
#     "TopScorer": "verbatim-string"
# }

# 自然言語の説明からテンプレートを生成する例
description = "I would like to extract important details from the contract."

messages = [
        {
            "role": "user",
            "content": [{"type": "text", "text": description}],
        }
    ]

text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True,
)

image_inputs = process_all_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    padding=True,
    return_tensors="pt",
).to("cuda")

generated_ids = model.generate(
    **inputs,
    **generation_config
)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)

print(output_text[0])
# {
#     "Contract": {
#         "Title": "verbatim-string",
#         "Description": "verbatim-string",
#         "Terms": [
#             {
#                 "Term": "verbatim-string",
#                 "Description": "verbatim-string"
#             }
#         ],
#         "Date": "date-time",
#         "Signatory": "verbatim-string"
#     }
# }

📚 ドキュメント

モデルの概要

モデルを使用するには、入力テキストまたは画像と、抽出する情報を記述したJSONテンプレートを提供する必要があります。テンプレートはJSONオブジェクトで、フィールド名と期待される型を指定する必要があります。

サポートされる型は以下の通りです。

verbatim-string - 入力にそのまま存在するテキストを抽出するようにモデルに指示します。
string - 言い換えや抽象化を含む一般的な文字列フィールドです。
integer - 整数です。
number - 整数または小数です。
date-time - ISO形式の日付です。
上記の型の配列（例：["string"]）
enum - 可能な回答のセットからの選択肢（テンプレートではオプションの配列として表されます。例：["yes", "no", "maybe"]）
multi-label - 複数の可能な回答を持つenum（テンプレートでは二重にラップされた配列として表されます。例：[["A", "B", "C"]]）

モデルがフィールドに関連する情報を識別できない場合、nullまたは[]（配列やマルチラベルの場合）を返します。

以下は、テンプレートの例です。

{
  "first_name": "verbatim-string",
  "last_name": "verbatim-string",
  "description": "string",
  "age": "integer",
  "gpa": "number",
  "birth_date": "date-time",
  "nationality": ["France", "England", "Japan", "USA", "China"],
  "languages_spoken": [["English", "French", "Japanese", "Mandarin", "Spanish"]]
}

出力の例は以下の通りです。

{
  "first_name": "Susan",
  "last_name": "Smith",
  "description": "A student studying computer science.",
  "age": 20,
  "gpa": 3.7,
  "birth_date": "2005-03-01",
  "nationality": "England",
  "languages_spoken": ["English", "French"]
}

🔧 技術詳細

ベンチマーク

約1,000の多様な抽出例（テキストと画像の入力を含む）に対するパフォーマンスです。

微調整

微調整のチュートリアルノートブックは、GitHubリポジトリのcookbooksフォルダにあります。

vLLMデプロイメント

OpenAI互換APIを提供するには、以下のコマンドを実行します。

vllm serve numind/NuExtract-2.0-8B --trust_remote_code --limit-mm-per-prompt image=6 --chat-template-content-format openai

メモリの問題が発生した場合は、--max-model-lenを適切に設定してください。

モデルにリクエストを送信するには、以下のようにします。

import json
from openai import OpenAI

openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

chat_response = client.chat.completions.create(
    model="numind/NuExtract-2.0-8B",
    temperature=0,
    messages=[
        {
            "role": "user", 
            "content": [{"type": "text", "text": "Yesterday I went shopping at Bunnings"}],
        },
    ],
    extra_body={
        "chat_template_kwargs": {
            "template": json.dumps(json.loads("""{\"store\": \"verbatim-string\"}"""), indent=4)
        },
    }
)
print("Chat response:", chat_response)

画像入力の場合は、以下のようにリクエストを構築します。"content"内の画像をプロンプトに表示される順序で並べるようにしてください（つまり、メイン入力の前に文脈内の例を配置します）。

import base64

def encode_image(image_path):
    """
    Encode the image file to base64 string
    """
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode('utf-8')

base64_image = encode_image("0.jpg")
base64_image2 = encode_image("1.jpg")

chat_response = client.chat.completions.create(
    model="numind/NuExtract-2.0-8B",
    temperature=0,
    messages=[
        {
            "role": "user", 
            "content": [
                {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{base64_image}"}}, # first ICL example image
            ]
        },
    ],
    extra_body={
        "chat_template_kwargs": {
            "template": json.dumps(json.loads("""{\"store\": \"verbatim-string\"}"""), indent=4)
        },
    }
)
print("Chat response:", chat_response)

📄 ライセンス

このプロジェクトはMITライセンスの下でライセンスされています。

重要提示

NuExtract-2.0-2Bは、最小のQwen2.5-VLモデル（3B）がより制限的な非商用ライセンスを持っているため、Qwen2-VLに基づいています。したがって、NuExtract-2.0-2Bは商用利用可能な小規模モデルのオプションとして含まれています。
NuExtractを使用する場合は、温度を0またはそれに非常に近い値に設定することをお勧めします。Ollamaなどの一部の推論フレームワークは、デフォルトで0.7を使用しますが、これは多くの抽出タスクに適していません。