NuExtract-2.0-8B Open-source Multimodal Model - Supports text and image input, free multilingual information extraction

Nuextract 2.0 8B

Developed by numind

NuExtract 2.0 is a series of multimodal models specifically trained for structured information extraction tasks. It supports text and image inputs and has the ability to process multiple languages.

Multimodal Fusion

Transformers

Open Source License:MIT #Multimodal information extraction #Structured data generation #Multilingual support

Downloads 328

Release Time : 5/6/2025

Model Overview

A structured information extraction model fine-tuned based on Qwen2.5-VL-7B-Instruct, which supports extracting structured data in a specified format from text or images.

Model Features

Multimodal support

Supports both text and image inputs, and can extract structured information from multiple data sources

Template-driven

Defines the output structure through JSON templates, flexibly adapting to different extraction requirements

In-context learning

Supports providing example samples (in-context learning) to improve the extraction accuracy in complex scenarios

Type system

Built-in rich data type support (string/number/date/enumeration, etc.)

Model Capabilities

Text information extraction

Image content parsing

Multilingual processing

Structured data generation

Template automatic generation

Use Cases

Document processing

Contract information extraction

Extract key terms, dates, and signatory information from legal contracts

Output structured JSON data

Invoice recognition

Extract information such as merchants, amounts, and dates from scanned invoices

Automatically generate data readable by the financial system

Retail scenario

Product label recognition

Extract information such as prices and specifications from product images

Generate a standardized product database

🚀 NuExtract 2.0 8B by NuMind

NuExtract 2.0 is a family of models specifically designed for structured information extraction tasks. It supports multimodal inputs and is multilingual, offering several versions of different sizes based on pre - trained models from the QwenVL family.

API / Platform | Blog | Discord

🚀 Quick Start

To start using NuExtract 2.0, you can follow the steps and examples provided in this README. First, you need to understand the basic usage of the model, including how to prepare input templates and process different types of input data (text or images).

✨ Features

Multimodal Support: Supports both text and image inputs, enabling more comprehensive information extraction.
Multilingual: Can handle information extraction tasks in multiple languages.
Multiple Model Sizes: Offers different model sizes (2B, 4B, 8B) to meet various resource and performance requirements.
Flexible Template Definition: Allows users to define custom JSON templates for information extraction.

📦 Installation

Although the original README doesn't provide detailed installation steps, you can use the transformers library to load the model. Here is an example of loading the model:

import torch
from transformers import AutoProcessor, AutoModelForVision2Seq

model_name = "numind/NuExtract-2.0-2B"
# model_name = "numind/NuExtract-2.0-8B"

model = AutoModelForVision2Seq.from_pretrained(model_name, 
                                               trust_remote_code=True, 
                                               torch_dtype=torch.bfloat16,
                                               attn_implementation="flash_attention_2",
                                               device_map="auto")
processor = AutoProcessor.from_pretrained(model_name, 
                                          trust_remote_code=True, 
                                          padding_side='left',
                                          use_fast=True)

💻 Usage Examples

Basic Usage

To perform a basic extraction of names from a text document:

template = """{"names": ["string"]}"""
document = "John went to the restaurant with Mary. James went to the cinema."

# prepare the user message content
messages = [{"role": "user", "content": document}]
text = processor.tokenizer.apply_chat_template(
    messages,
    template=template, # template is specified here
    tokenize=False,
    add_generation_prompt=True,
)

print(text)
""""<|im_start|>user
# Template:
{"names": ["string"]}
# Context:
John went to the restaurant with Mary. James went to the cinema.<|im_end|> 
<|im_start|>assistant"""

image_inputs = process_all_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    padding=True,
    return_tensors="pt",
).to("cuda")

# we choose greedy sampling here, which works well for most information extraction tasks
generation_config = {"do_sample": False, "num_beams": 1, "max_new_tokens": 2048}

# Inference: Generation of the output
generated_ids = model.generate(
    **inputs,
    **generation_config
)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)

print(output_text)
# ['{"names": ["John", "Mary", "James"]}']

Advanced Usage

In - Context Examples

Sometimes, providing in - context examples can help the model better understand the task. Here is an example:

template = """{"names": ["string"]}"""
document = "John went to the restaurant with Mary. James went to the cinema."
examples = [
    {
        "input": "Stephen is the manager at Susan's store.",
        "output": """{"names": ["-STEPHEN-", "-SUSAN-"]}"""
    }
]

messages = [{"role": "user", "content": document}]
text = processor.tokenizer.apply_chat_template(
    messages,
    template=template,
    examples=examples, # examples provided here
    tokenize=False,
    add_generation_prompt=True,
)

image_inputs = process_all_vision_info(messages, examples)
inputs = processor(
    text=[text],
    images=image_inputs,
    padding=True,
    return_tensors="pt",
).to("cuda")

# we choose greedy sampling here, which works well for most information extraction tasks
generation_config = {"do_sample": False, "num_beams": 1, "max_new_tokens": 2048}

# Inference: Generation of the output
generated_ids = model.generate(
    **inputs,
    **generation_config
)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
# ['{"names": ["-JOHN-", "-MARY-", "-JAMES-"]}']

Image Inputs

If you want to give image inputs to NuExtract:

template = """{"store": "verbatim-string"}"""
document = {"type": "image", "image": "file://1.jpg"}

messages = [{"role": "user", "content": [document]}]
text = processor.tokenizer.apply_chat_template(
    messages,
    template=template,
    tokenize=False,
    add_generation_prompt=True,
)

image_inputs = process_all_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    padding=True,
    return_tensors="pt",
).to("cuda")

generation_config = {"do_sample": False, "num_beams": 1, "max_new_tokens": 2048}

# Inference: Generation of the output
generated_ids = model.generate(
    **inputs,
    **generation_config
)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
# ['{"store": "Trader Joe\'s"}']

Batch Inference

inputs = [
    # image input with no ICL examples
    {
        "document": {"type": "image", "image": "file://0.jpg"},
        "template": """{"store_name": "verbatim-string"}""",
    },
    # image input with 1 ICL example
    {
        "document": {"type": "image", "image": "file://0.jpg"},
        "template": """{"store_name": "verbatim-string"}""",
        "examples": [
            {
                "input": {"type": "image", "image": "file://1.jpg"},
                "output": """{"store_name": "Trader Joe's"}""",
            }
        ],
    },
    # text input with no ICL examples
    {
        "document": {"type": "text", "text": "John went to the restaurant with Mary. James went to the cinema."},
        "template": """{"names": ["string"]}""",
    },
    # text input with ICL example
    {
        "document": {"type": "text", "text": "John went to the restaurant with Mary. James went to the cinema."},
        "template": """{"names": ["string"]}""",
        "examples": [
            {
                "input": "Stephen is the manager at Susan's store.",
                "output": """{"names": ["STEPHEN", "SUSAN"]}"""
            }
        ],
    },
]

# messages should be a list of lists for batch processing
messages = [
    [
        {
            "role": "user",
            "content": [x['document']],
        }
    ]
    for x in inputs
]

# apply chat template to each example individually
texts = [
    processor.tokenizer.apply_chat_template(
        messages[i],  # Now this is a list containing one message
        template=x['template'],
        examples=x.get('examples', None),
        tokenize=False, 
        add_generation_prompt=True)
    for i, x in enumerate(inputs)
]

image_inputs = process_all_vision_info(messages, [x.get('examples') for x in inputs])
inputs = processor(
    text=texts,
    images=image_inputs,
    padding=True,
    return_tensors="pt",
).to("cuda")

generation_config = {"do_sample": False, "num_beams": 1, "max_new_tokens": 2048}

# Batch Inference
generated_ids = model.generate(**inputs, **generation_config)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_texts = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
for y in output_texts:
    print(y)
# {"store_name": "WAL-MART"}
# {"store_name": "Walmart"}
# {"names": ["John", "Mary", "James"]}
# {"names": ["JOHN", "MARY", "JAMES"]}

Template Generation

You can use NuExtract 2.0 to generate templates from existing schema files or natural language descriptions.

Convert XML into a NuExtract template

xml_template = """<SportResult>
    <Date></Date>
    <Sport></Sport>
    <Venue></Venue>
    <HomeTeam></HomeTeam>
    <AwayTeam></AwayTeam>
    <HomeScore></HomeScore>
    <AwayScore></AwayScore>
    <TopScorer></TopScorer>
</SportResult>"""

messages = [
        {
            "role": "user",
            "content": [{"type": "text", "text": xml_template}],
        }
    ]

text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True,
)

image_inputs = process_all_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    padding=True,
    return_tensors="pt",
).to("cuda")

generated_ids = model.generate(
    **inputs,
    **generation_config
)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)

print(output_text[0])
# {
#     "Date": "date-time",
#     "Sport": "verbatim-string",
#     "Venue": "verbatim-string",
#     "HomeTeam": "verbatim-string",
#     "AwayTeam": "verbatim-string",
#     "HomeScore": "integer",
#     "AwayScore": "integer",
#     "TopScorer": "verbatim-string"
# }

Generate a template from natural language description

description = "I would like to extract important details from the contract."

messages = [
        {
            "role": "user",
            "content": [{"type": "text", "text": description}],
        }
    ]

text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True,
)

image_inputs = process_all_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    padding=True,
    return_tensors="pt",
).to("cuda")

generated_ids = model.generate(
    **inputs,
    **generation_config
)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)

print(output_text[0])
# {
#     "Contract": {
#         "Title": "verbatim-string",
#         "Description": "verbatim-string",
#         "Terms": [
#             {
#                 "Term": "verbatim-string",
#                 "Description": "verbatim-string"
#             }
#         ],
#         "Date": "date-time",
#         "Signatory": "verbatim-string"
#     }
# }

📚 Documentation

Model Overview

The NuExtract 2.0 family of models is designed for structured information extraction. It supports multimodal inputs (text and images) and is multilingual. The models are based on pre - trained models from the QwenVL family, and different sizes are available to meet various needs.

Input Template

To use the model, you need to provide an input text/image and a JSON template describing the information you need to extract. The template should be a JSON object, specifying field names and their expected type.

Supported Types

verbatim-string: Extract text that is present verbatim in the input.
string: A generic string field that can incorporate paraphrasing/abstraction.
integer: A whole number.
number: A whole or decimal number.
date-time: ISO formatted date.
Array of any of the above types (e.g. ["string"]).
enum: A choice from a set of possible answers (represented in the template as an array of options, e.g. ["yes", "no", "maybe"]).
multi-label: An enum that can have multiple possible answers (represented in the template as a double - wrapped array, e.g. [["A", "B", "C"]]).

If the model does not identify relevant information for a field, it will return null or [] (for arrays and multi - labels).

🔧 Technical Details

The model is based on the transformers library and uses the AutoProcessor and AutoModelForVision2Seq classes to load and process the model. It supports multimodal inputs by handling both text and image data. The process_all_vision_info function is used to handle the loading and processing of image input data.

📄 License

The NuExtract-2.0-2B and NuExtract-2.0-8B models are under the MIT license. The NuExtract-2.0-4B model is under the Qwen Research License.

Property	Details
Model Type	Structured information extraction model
Training Data	Not provided in the original README

⚠️ Important Note

We recommend using NuExtract with a temperature at or very close to 0. Some inference frameworks, such as Ollama, use a default of 0.7 which is not well suited to many extraction tasks.

💡 Usage Tip

When using NuExtract, providing multiple in - context examples usually leads to better results.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご