NuExtract-2.0-4B开源多模态模型 - 免费部署多语言结构化信息提取

首页

Nuextract 2.0 4B

由 numind 开发

NuExtract 2.0是专为结构化信息提取任务训练的多模态模型系列，支持文本和图像输入，具备多语言处理能力。

图像生成文本

Transformers

开源协议:MIT #结构化信息提取 #多模态处理 #多语言支持

下载量 272

发布时间 : 5/26/2025

模型简介

NuExtract 2.0是基于QwenVL系列预训练模型优化的信息提取模型，支持通过JSON模板从文本或图像中提取结构化数据。

模型特点

多模态支持

同时支持文本和图像输入，可从多种数据源提取信息

模板驱动

通过JSON模板定义提取字段和类型，灵活适应不同场景需求

多语言处理

具备处理多种语言的能力

上下文学习

支持通过示例指导模型理解复杂任务

模型能力

文本信息提取

图像信息提取

结构化数据生成

多语言处理

批量推理

使用案例

文档处理

合同信息提取

从法律合同中提取关键条款、日期和签署方信息

生成结构化JSON数据

简历解析

从简历文本中提取姓名、教育背景和工作经历

标准化的人才数据库录入

商业应用

收据识别

从收据图像中提取商家名称、金额和日期

自动化的费用报销处理

产品信息提取

从产品包装图像中提取规格、成分等信息

电商产品目录自动化

🚀 NuExtract 2.0 4B

NuExtract 2.0是专门为结构化信息提取任务训练的一系列模型。它支持多模态输入，并且具备多语言处理能力。我们提供了多种不同规模的版本，所有版本均基于QwenVL系列的预训练模型。

🚀 快速开始

若要使用该模型，需提供输入文本或图像，以及描述所需提取信息的JSON模板。模板应为JSON对象，需指定字段名称及其预期类型。

支持的类型包括：

verbatim-string：指示模型提取输入中逐字存在的文本。
string：通用字符串字段，可包含释义或抽象内容。
integer：整数。
number：整数或小数。
date-time：ISO格式的日期。
上述任何类型的数组（例如["string"]）
enum：从一组可能的答案中进行选择（在模板中表示为选项数组，例如["yes", "no", "maybe"]）。
multi-label：一种可以有多个可能答案的枚举类型（在模板中表示为双重包装的数组，例如[["A", "B", "C"]]）。

如果模型未识别出某个字段的相关信息，则会返回null或[]（适用于数组和多标签）。

以下是一个示例模板：

{
  "first_name": "verbatim-string",
  "last_name": "verbatim-string",
  "description": "string",
  "age": "integer",
  "gpa": "number",
  "birth_date": "date-time",
  "nationality": ["France", "England", "Japan", "USA", "China"],
  "languages_spoken": [["English", "French", "Japanese", "Mandarin", "Spanish"]]
}

示例输出：

{
  "first_name": "Susan",
  "last_name": "Smith",
  "description": "A student studying computer science.",
  "age": 20,
  "gpa": 3.7,
  "birth_date": "2005-03-01",
  "nationality": "England",
  "languages_spoken": ["English", "French"]
}

⚠️ 重要提示

建议将NuExtract的温度设置为0或非常接近0的值。一些推理框架（如Ollama）默认使用0.7的温度，这可能不太适合许多提取任务。

✨ 主要特性

多模态支持：支持文本和图像输入。
多语言处理：具备多语言处理能力。
多种模型规模：提供2B、4B和8B等不同规模的模型。

📦 安装指南

使用transformers库加载模型：

import torch
from transformers import AutoProcessor, AutoModelForVision2Seq

model_name = "numind/NuExtract-2.0-2B"
# model_name = "numind/NuExtract-2.0-8B"

model = AutoModelForVision2Seq.from_pretrained(model_name, 
                                               trust_remote_code=True, 
                                               torch_dtype=torch.bfloat16,
                                               attn_implementation="flash_attention_2",
                                               device_map="auto")
processor = AutoProcessor.from_pretrained(model_name, 
                                          trust_remote_code=True, 
                                          padding_side='left',
                                          use_fast=True)

# 可根据需要设置min_pixels和max_pixels，例如令牌范围为256 - 1280，以平衡性能和成本。
# min_pixels = 256*28*28
# max_pixels = 1280*28*28
# processor = AutoProcessor.from_pretrained(model_name, min_pixels=min_pixels, max_pixels=max_pixels)

💻 使用示例

基础用法

例如，从文本文件中提取姓名：

template = """{"names": ["string"]}"""
document = "John went to the restaurant with Mary. James went to the cinema."

# 准备用户消息内容
messages = [{"role": "user", "content": document}]
text = processor.tokenizer.apply_chat_template(
    messages,
    template=template, # 在此处指定模板
    tokenize=False,
    add_generation_prompt=True,
)

print(text)
""""<|im_start|>user
# Template:
{"names": ["string"]}
# Context:
John went to the restaurant with Mary. James went to the cinema.<|im_end|> 
<|im_start|>assistant"""

image_inputs = process_all_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    padding=True,
    return_tensors="pt",
).to("cuda")

# 这里选择贪心采样，这适用于大多数信息提取任务
generation_config = {"do_sample": False, "num_beams": 1, "max_new_tokens": 2048}

# 推理：生成输出
generated_ids = model.generate(
    **inputs,
    **generation_config
)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)

print(output_text)
# ['{"names": ["John", "Mary", "James"]}']

高级用法

上下文示例

有时，由于任务具有挑战性或存在一定的模糊性，模型的表现可能不尽如人意。或者，我们可能希望模型遵循特定的格式，或者为其提供更多帮助。在这种情况下，提供“上下文示例”可以帮助NuExtract更好地理解任务。

template = """{"names": ["string"]}"""
document = "John went to the restaurant with Mary. James went to the cinema."
examples = [
    {
        "input": "Stephen is the manager at Susan's store.",
        "output": """{"names": ["-STEPHEN-", "-SUSAN-"]}"""
    }
]

messages = [{"role": "user", "content": document}]
text = processor.tokenizer.apply_chat_template(
    messages,
    template=template,
    examples=examples, # 在此处提供示例
    tokenize=False,
    add_generation_prompt=True,
)

image_inputs = process_all_vision_info(messages, examples)
inputs = processor(
    text=[text],
    images=image_inputs,
    padding=True,
    return_tensors="pt",
).to("cuda")

# 这里选择贪心采样，这适用于大多数信息提取任务
generation_config = {"do_sample": False, "num_beams": 1, "max_new_tokens": 2048}

# 推理：生成输出
generated_ids = model.generate(
    **inputs,
    **generation_config
)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
# ['{"names": ["-JOHN-", "-MARY-", "-JAMES-"]}']

图像输入

如果要向NuExtract提供图像输入，只需将指定所需图像文件的字典作为消息内容，而不是字符串。（例如{"type": "image", "image": "file://image.jpg"}）。

template = """{"store": "verbatim-string"}"""
document = {"type": "image", "image": "file://1.jpg"}

messages = [{"role": "user", "content": [document]}]
text = processor.tokenizer.apply_chat_template(
    messages,
    template=template,
    tokenize=False,
    add_generation_prompt=True,
)

image_inputs = process_all_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    padding=True,
    return_tensors="pt",
).to("cuda")

generation_config = {"do_sample": False, "num_beams": 1, "max_new_tokens": 2048}

# 推理：生成输出
generated_ids = model.generate(
    **inputs,
    **generation_config
)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
# ['{"store": "Trader Joe\'s"}']

批量推理

inputs = [
    # 无上下文学习示例的图像输入
    {
        "document": {"type": "image", "image": "file://0.jpg"},
        "template": """{"store_name": "verbatim-string"}""",
    },
    # 有1个上下文学习示例的图像输入
    {
        "document": {"type": "image", "image": "file://0.jpg"},
        "template": """{"store_name": "verbatim-string"}""",
        "examples": [
            {
                "input": {"type": "image", "image": "file://1.jpg"},
                "output": """{"store_name": "Trader Joe's"}""",
            }
        ],
    },
    # 无上下文学习示例的文本输入
    {
        "document": {"type": "text", "text": "John went to the restaurant with Mary. James went to the cinema."},
        "template": """{"names": ["string"]}""",
    },
    # 有上下文学习示例的文本输入
    {
        "document": {"type": "text", "text": "John went to the restaurant with Mary. James went to the cinema."},
        "template": """{"names": ["string"]}""",
        "examples": [
            {
                "input": "Stephen is the manager at Susan's store.",
                "output": """{"names": ["STEPHEN", "SUSAN"]}"""
            }
        ],
    },
]

# 批量处理时，messages应为列表的列表
messages = [
    [
        {
            "role": "user",
            "content": [x['document']],
        }
    ]
    for x in inputs
]

# 分别对每个示例应用聊天模板
texts = [
    processor.tokenizer.apply_chat_template(
        messages[i],  # 现在这是一个包含一条消息的列表
        template=x['template'],
        examples=x.get('examples', None),
        tokenize=False, 
        add_generation_prompt=True)
    for i, x in enumerate(inputs)
]

image_inputs = process_all_vision_info(messages, [x.get('examples') for x in inputs])
inputs = processor(
    text=texts,
    images=image_inputs,
    padding=True,
    return_tensors="pt",
).to("cuda")

generation_config = {"do_sample": False, "num_beams": 1, "max_new_tokens": 2048}

# 批量推理
generated_ids = model.generate(**inputs, **generation_config)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_texts = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
for y in output_texts:
    print(y)
# {"store_name": "WAL-MART"}
# {"store_name": "Walmart"}
# {"names": ["John", "Mary", "James"]}
# {"names": ["JOHN", "MARY", "JAMES"]}

模板生成

如果你想将现有的其他格式（如XML、YAML等）的模式文件进行转换，或者从示例开始，NuExtract 2.0模型可以自动为你生成模板。

# 将XML转换为NuExtract模板
xml_template = """<SportResult>
    <Date></Date>
    <Sport></Sport>
    <Venue></Venue>
    <HomeTeam></HomeTeam>
    <AwayTeam></AwayTeam>
    <HomeScore></HomeScore>
    <AwayScore></AwayScore>
    <TopScorer></TopScorer>
</SportResult>"""

messages = [
        {
            "role": "user",
            "content": [{"type": "text", "text": xml_template}],
        }
    ]

text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True,
)

image_inputs = process_all_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    padding=True,
    return_tensors="pt",
).to("cuda")

generated_ids = model.generate(
    **inputs,
    **generation_config
)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)

print(output_text[0])
# {
#     "Date": "date-time",
#     "Sport": "verbatim-string",
#     "Venue": "verbatim-string",
#     "HomeTeam": "verbatim-string",
#     "AwayTeam": "verbatim-string",
#     "HomeScore": "integer",
#     "AwayScore": "integer",
#     "TopScorer": "verbatim-string"
# }

# 从自然语言描述生成模板
description = "I would like to extract important details from the contract."

messages = [
        {
            "role": "user",
            "content": [{"type": "text", "text": description}],
        }
    ]

text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True,
)

image_inputs = process_all_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    padding=True,
    return_tensors="pt",
).to("cuda")

generated_ids = model.generate(
    **inputs,
    **generation_config
)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)

print(output_text[0])
# {
#     "Contract": {
#         "Title": "verbatim-string",
#         "Description": "verbatim-string",
#         "Terms": [
#             {
#                 "Term": "verbatim-string",
#                 "Description": "verbatim-string"
#             }
#         ],
#         "Date": "date-time",
#         "Signatory": "verbatim-string"
#     }
# }

📚 详细文档

模型版本

模型规模	模型名称	基础模型	许可证	Huggingface链接
2B	NuExtract-2.0-2B	Qwen2-VL-2B-Instruct	MIT	NuExtract-2.0-2B
4B	NuExtract-2.0-4B	Qwen2.5-VL-3B-Instruct	Qwen研究许可证	NuExtract-2.0-4B
8B	NuExtract-2.0-8B	Qwen2.5-VL-7B-Instruct	MIT	NuExtract-2.0-8B

基准测试

在包含约1000个不同提取示例（包含文本和图像输入）的集合上的性能表现：基准测试

🔧 技术细节

微调

你可以在GitHub仓库的cookbooks文件夹中找到微调教程笔记本。

vLLM部署

运行以下命令以提供与OpenAI兼容的API：

vllm serve numind/NuExtract-2.0-8B --trust_remote_code --limit-mm-per-prompt image=6 --chat-template-content-format openai

如果你遇到内存问题，请相应地设置--max-model-len。

发送请求到模型的示例代码：

import json
from openai import OpenAI

openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

chat_response = client.chat.completions.create(
    model="numind/NuExtract-2.0-8B",
    temperature=0,
    messages=[
        {
            "role": "user", 
            "content": [{"type": "text", "text": "Yesterday I went shopping at Bunnings"}],
        },
    ],
    extra_body={
        "chat_template_kwargs": {
            "template": json.dumps(json.loads("""{\"store\": \"verbatim-string\"}"""), indent=4)
        },
    }
)
print("Chat response:", chat_response)

对于图像输入，请求结构如下所示。确保在"content"中按它们在提示中出现的顺序排列图像（即任何上下文示例在主输入之前）。

import base64

def encode_image(image_path):
    """
    将图像文件编码为base64字符串
    """
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode('utf-8')

base64_image = encode_image("0.jpg")
base64_image2 = encode_image("1.jpg")

chat_response = client.chat.completions.create(
    model="numind/NuExtract-2.0-8B",
    temperature=0,
    messages=[
        {
            "role": "user", 
            "content": [
                {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{base64_image}"}}, # 第一个上下文学习示例图像
                # 其他图像...
            ]
        }
    ]
)