NuExtract-2.0-8B开源多模态模型 - 支持图文输入，免费多语言信息提取

首页

Nuextract 2.0 8B

由 numind 开发

NuExtract 2.0是专为结构化信息提取任务训练的多模态模型系列，支持文本和图像输入，具备多语言处理能力。

多模态融合

Transformers

开源协议:MIT #多模态信息提取 #结构化数据生成 #多语言支持

下载量 328

发布时间 : 5/6/2025

模型简介

基于Qwen2.5-VL-7B-Instruct微调的结构化信息提取模型，支持从文本或图像中提取指定格式的结构化数据。

模型特点

多模态支持

同时支持文本和图像输入，可从多种数据源提取结构化信息

模板驱动

通过JSON模板定义输出结构，灵活适应不同提取需求

上下文学习

支持提供示例样本(in-context learning)提升复杂场景下的提取准确率

类型系统

内置丰富的数据类型支持(字符串/数字/日期/枚举等)

模型能力

文本信息提取

图像内容解析

多语言处理

结构化数据生成

模板自动生成

使用案例

文档处理

合同信息提取

从法律合同中提取关键条款、日期和签署方信息

输出结构化JSON数据

发票识别

从扫描发票中提取商户、金额、日期等信息

自动生成财务系统可读数据

零售场景

商品标签识别

从商品图片中提取价格、规格等信息

生成标准化的产品数据库

🚀 NuExtract 2.0 8B by NuMind

NuExtract 2.0是专门为结构化信息提取任务训练的一系列模型。它支持多模态输入，并且具备多语言处理能力。我们提供了几种不同规模的版本，所有版本均基于QwenVL系列的预训练模型。

🚀 快速开始

模型选择

我们提供了不同大小的模型版本，具体信息如下：

模型大小	模型名称	基础模型	许可证	Huggingface链接
2B	NuExtract-2.0-2B	Qwen2-VL-2B-Instruct	MIT	NuExtract-2.0-2B
4B	NuExtract-2.0-4B	Qwen2.5-VL-3B-Instruct	Qwen研究许可证	NuExtract-2.0-4B
8B	NuExtract-2.0-8B	Qwen2.5-VL-7B-Instruct	MIT	NuExtract-2.0-8B

⚠️ 重要提示

NuExtract-2.0-2B基于Qwen2-VL而非Qwen2.5-VL，因为最小的Qwen2.5-VL模型（3B）具有更严格的非商业许可证。因此，我们将NuExtract-2.0-2B作为可用于商业用途的小模型选项。

✨ 主要特性

多模态支持：支持文本和图像的多模态输入。
多语言处理：具备多语言处理能力。
多种输出类型：支持多种输出类型，如字符串、整数、日期等。

📦 安装指南

使用前请确保已经安装了transformers库。以下是使用transformers加载模型的示例代码：

import torch
from transformers import AutoProcessor, AutoModelForVision2Seq

model_name = "numind/NuExtract-2.0-2B"
# model_name = "numind/NuExtract-2.0-8B"

model = AutoModelForVision2Seq.from_pretrained(model_name, 
                                               trust_remote_code=True, 
                                               torch_dtype=torch.bfloat16,
                                               attn_implementation="flash_attention_2",
                                               device_map="auto")
processor = AutoProcessor.from_pretrained(model_name, 
                                          trust_remote_code=True, 
                                          padding_side='left',
                                          use_fast=True)

# You can set min_pixels and max_pixels according to your needs, such as a token range of 256-1280, to balance performance and cost.
# min_pixels = 256*28*28
# max_pixels = 1280*28*28
# processor = AutoProcessor.from_pretrained(model_name, min_pixels=min_pixels, max_pixels=max_pixels)

💻 使用示例

基础用法

以下是一个从文本中提取姓名的基础示例：

template = """{"names": ["string"]}"""
document = "John went to the restaurant with Mary. James went to the cinema."

# prepare the user message content
messages = [{"role": "user", "content": document}]
text = processor.tokenizer.apply_chat_template(
    messages,
    template=template, # template is specified here
    tokenize=False,
    add_generation_prompt=True,
)

print(text)
""""<|im_start|>user
# Template:
{"names": ["string"]}
# Context:
John went to the restaurant with Mary. James went to the cinema.<|im_end|> 
<|im_start|>assistant"""

image_inputs = process_all_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    padding=True,
    return_tensors="pt",
).to("cuda")

# we choose greedy sampling here, which works well for most information extraction tasks
generation_config = {"do_sample": False, "num_beams": 1, "max_new_tokens": 2048}

# Inference: Generation of the output
generated_ids = model.generate(
    **inputs,
    **generation_config
)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)

print(output_text)
# ['{"names": ["John", "Mary", "James"]}']

高级用法

上下文示例

有时候模型可能无法达到我们期望的效果，因为任务具有挑战性或存在一定的歧义。在这种情况下，提供“上下文示例”可以帮助模型更好地理解任务。以下是一个使用上下文示例的示例：

template = """{"names": ["string"]}"""
document = "John went to the restaurant with Mary. James went to the cinema."
examples = [
    {
        "input": "Stephen is the manager at Susan's store.",
        "output": """{"names": ["-STEPHEN-", "-SUSAN-"]}"""
    }
]

messages = [{"role": "user", "content": document}]
text = processor.tokenizer.apply_chat_template(
    messages,
    template=template,
    examples=examples, # examples provided here
    tokenize=False,
    add_generation_prompt=True,
)

image_inputs = process_all_vision_info(messages, examples)
inputs = processor(
    text=[text],
    images=image_inputs,
    padding=True,
    return_tensors="pt",
).to("cuda")

# we choose greedy sampling here, which works well for most information extraction tasks
generation_config = {"do_sample": False, "num_beams": 1, "max_new_tokens": 2048}

# Inference: Generation of the output
generated_ids = model.generate(
    **inputs,
    **generation_config
)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
# ['{"names": ["-JOHN-", "-MARY-", "-JAMES-"]}']

图像输入

如果要给模型提供图像输入，只需将消息内容指定为一个描述所需图像文件的字典，而不是字符串。以下是一个使用图像输入的示例：

template = """{"store": "verbatim-string"}"""
document = {"type": "image", "image": "file://1.jpg"}

messages = [{"role": "user", "content": [document]}]
text = processor.tokenizer.apply_chat_template(
    messages,
    template=template,
    tokenize=False,
    add_generation_prompt=True,
)

image_inputs = process_all_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    padding=True,
    return_tensors="pt",
).to("cuda")

generation_config = {"do_sample": False, "num_beams": 1, "max_new_tokens": 2048}

# Inference: Generation of the output
generated_ids = model.generate(
    **inputs,
    **generation_config
)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
# ['{"store": "Trader Joe\'s"}']

批量推理

以下是一个批量推理的示例：

inputs = [
    # image input with no ICL examples
    {
        "document": {"type": "image", "image": "file://0.jpg"},
        "template": """{"store_name": "verbatim-string"}""",
    },
    # image input with 1 ICL example
    {
        "document": {"type": "image", "image": "file://0.jpg"},
        "template": """{"store_name": "verbatim-string"}""",
        "examples": [
            {
                "input": {"type": "image", "image": "file://1.jpg"},
                "output": """{"store_name": "Trader Joe's"}""",
            }
        ],
    },
    # text input with no ICL examples
    {
        "document": {"type": "text", "text": "John went to the restaurant with Mary. James went to the cinema."},
        "template": """{"names": ["string"]}""",
    },
    # text input with ICL example
    {
        "document": {"type": "text", "text": "John went to the restaurant with Mary. James went to the cinema."},
        "template": """{"names": ["string"]}""",
        "examples": [
            {
                "input": "Stephen is the manager at Susan's store.",
                "output": """{"names": ["STEPHEN", "SUSAN"]}"""
            }
        ],
    },
]

# messages should be a list of lists for batch processing
messages = [
    [
        {
            "role": "user",
            "content": [x['document']],
        }
    ]
    for x in inputs
]

# apply chat template to each example individually
texts = [
    processor.tokenizer.apply_chat_template(
        messages[i],  # Now this is a list containing one message
        template=x['template'],
        examples=x.get('examples', None),
        tokenize=False, 
        add_generation_prompt=True)
    for i, x in enumerate(inputs)
]

image_inputs = process_all_vision_info(messages, [x.get('examples') for x in inputs])
inputs = processor(
    text=texts,
    images=image_inputs,
    padding=True,
    return_tensors="pt",
).to("cuda")

generation_config = {"do_sample": False, "num_beams": 1, "max_new_tokens": 2048}

# Batch Inference
generated_ids = model.generate(**inputs, **generation_config)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_texts = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
for y in output_texts:
    print(y)
# {"store_name": "WAL-MART"}
# {"store_name": "Walmart"}
# {"names": ["John", "Mary", "James"]}
# {"names": ["JOHN", "MARY", "JAMES"]}

模板生成

如果想将现有的其他格式的模式文件（如XML、YAML等）转换为NuExtract模板，或者从自然语言描述生成模板，NuExtract 2.0模型可以自动为你生成。以下是两个示例：

# convert XML into a NuExtract template
xml_template = """<SportResult>
    <Date></Date>
    <Sport></Sport>
    <Venue></Venue>
    <HomeTeam></HomeTeam>
    <AwayTeam></AwayTeam>
    <HomeScore></HomeScore>
    <AwayScore></AwayScore>
    <TopScorer></TopScorer>
</SportResult>"""

messages = [
        {
            "role": "user",
            "content": [{"type": "text", "text": xml_template}],
        }
    ]

text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True,
)

image_inputs = process_all_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    padding=True,
    return_tensors="pt",
).to("cuda")

generated_ids = model.generate(
    **inputs,
    **generation_config
)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)

print(output_text[0])
# {
#     "Date": "date-time",
#     "Sport": "verbatim-string",
#     "Venue": "verbatim-string",
#     "HomeTeam": "verbatim-string",
#     "AwayTeam": "verbatim-string",
#     "HomeScore": "integer",
#     "AwayScore": "integer",
#     "TopScorer": "verbatim-string"
# }

# generate a template from natural language description
description = "I would like to extract important details from the contract."

messages = [
        {
            "role": "user",
            "content": [{"type": "text", "text": description}],
        }
    ]

text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True,
)

image_inputs = process_all_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    padding=True,
    return_tensors="pt",
).to("cuda")

generated_ids = model.generate(
    **inputs,
    **generation_config
)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)

print(output_text[0])
# {
#     "Contract": {
#         "Title": "verbatim-string",
#         "Description": "verbatim-string",
#         "Terms": [
#             {
#                 "Term": "verbatim-string",
#                 "Description": "verbatim-string"
#             }
#         ],
#         "Date": "date-time",
#         "Signatory": "verbatim-string"
#     }
# }

📚 详细文档

模板支持类型

使用模型时，需要提供一个输入文本/图像和一个描述需要提取信息的JSON模板。模板应该是一个JSON对象，指定字段名及其预期类型。支持的类型包括：

verbatim-string - 指示模型提取输入中逐字存在的文本。
string - 一个通用的字符串字段，可以包含释义/抽象内容。
integer - 一个整数。
number - 一个整数或小数。
date-time - ISO格式的日期。
上述任何类型的数组（例如 ["string"]）
enum - 从一组可能的答案中选择（在模板中表示为选项数组，例如 ["yes", "no", "maybe"]）。
multi-label - 一个可以有多个可能答案的枚举（在模板中表示为双重包装的数组，例如 [["A", "B", "C"]]）。

如果模型没有为某个字段识别到相关信息，它将返回 null 或 []（对于数组和多标签）。

以下是一个示例模板：

{
  "first_name": "verbatim-string",
  "last_name": "verbatim-string",
  "description": "string",
  "age": "integer",
  "gpa": "number",
  "birth_date": "date-time",
  "nationality": ["France", "England", "Japan", "USA", "China"],
  "languages_spoken": [["English", "French", "Japanese", "Mandarin", "Spanish"]]
}

一个示例输出：

{
  "first_name": "Susan",
  "last_name": "Smith",
  "description": "A student studying computer science.",
  "age": 20,
  "gpa": 3.7,
  "birth_date": "2005-03-01",
  "nationality": "England",
  "languages_spoken": ["English", "French"]
}

💡 使用建议

建议在使用NuExtract时将温度设置为0或非常接近0。一些推理框架（如Ollama）默认使用0.7的温度，这可能不适用于许多提取任务。

🔧 技术细节

图像输入数据处理函数

以下是一个处理图像输入数据加载的函数：

def process_all_vision_info(messages, examples=None):
    """
    Process vision information from both messages and in-context examples, supporting batch processing.
    
    Args:
        messages: List of message dictionaries (single input) OR list of message lists (batch input)
        examples: Optional list of example dictionaries (single input) OR list of example lists (batch)
    
    Returns:
        A flat list of all images in the correct order:
        - For single input: example images followed by message images
        - For batch input: interleaved as (item1 examples, item1 input, item2 examples, item2 input, etc.)
        - Returns None if no images were found
    """
    from qwen_vl_utils import process_vision_info, fetch_image
    
    # Helper function to extract images from examples
    def extract_example_images(example_item):
        if not example_item:
            return []
            
        # Handle both list of examples and single example
        examples_to_process = example_item if isinstance(example_item, list) else [example_item]
        images = []
        
        for example in examples_to_process:
            if isinstance(example.get('input'), dict) and example['input'].get('type') == 'image':
                images.append(fetch_image(example['input']))
                
        return images
    
    # Normalize inputs to always be batched format
    is_batch = messages and isinstance(messages[0], list)
    messages_batch = messages if is_batch else [messages]
    is_batch_examples = examples and isinstance(examples, list) and (isinstance(examples[0], list) or examples[0] is None)
    examples_batch = examples if is_batch_examples else ([examples] if examples is not None else None)
    
    # Ensure examples batch matches messages batch if provided
    if examples and len(examples_batch) != len(messages_batch):
        if not is_batch and len(examples_batch) == 1:
            # Single example set for a single input is fine
            pass
        else:
            raise ValueError("Examples batch length must match messages batch length")
    
    # Process all inputs, maintaining correct order
    all_images = []
    for i, message_group in enumerate(messages_batch):
        # Get example images for this input
        if examples and i < len(examples_batch):
            input_example_images = extract_example_images(examples_batch[i])
            all_images.extend(input_example_images)
        
        # Get message images for this input
        input_message_images = process_vision_info(message_group)[0] or []
        all_images.extend(input_message_images)
    
    return all_images if all_images else None

📄 许可证

本项目采用MIT许可证。

微调与部署

微调

你可以在GitHub仓库的cookbooks文件夹中找到微调教程笔记本。

vLLM部署

运行以下命令以提供一个与OpenAI兼容的API：

vllm serve numind/NuExtract-2.0-8B --trust_remote_code --limit-mm-per-prompt image=6 --chat-template-content-format openai

如果你遇到内存问题，请相应地设置--max-model-len。

以下是向模型发送请求的示例代码：

import json
from openai import OpenAI

openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

chat_response = client.chat.completions.create(
    model="numind/NuExtract-2.0-8B",
    temperature=0,
    messages=[
        {
            "role": "user", 
            "content": [{"type": "text", "text": "Yesterday I went shopping at Bunnings"}],
        },
    ],
    extra_body={
        "chat_template_kwargs": {
            "template": json.dumps(json.loads("""{\"store\": \"verbatim-string\"}"""), indent=4)
        },
    }
)
print("Chat response:", chat_response)

对于图像输入，请求结构如下所示。确保在 "content" 中按图像在提示中出现的顺序排列（即任何上下文示例在主输入之前）。

import base64

def encode_image(image_path):
    """
    Encode the image file to base64 string
    """
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode('utf-8')

base64_image = encode_image("0.jpg")
base64_image2 = encode_image("1.jpg")

chat_response = client.chat.completions.create(
    model="numind/NuExtract-2.0-8B",
    temperature=0,
    messages=[
        {
            "role": "user", 
            "content": [
                {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{base64_image}"}}, # first ICL example image