NuExtract-2.0-2B开源多模态多语言模型 - 免费部署助力结构化信息提取

首页

Nuextract 2.0 2B

由 numind 开发

NuExtract 2.0是专为结构化信息提取任务训练的多模态多语言模型系列，基于QwenVL系列预训练模型开发。

多模态融合

Transformers

开源协议:MIT #多模态信息提取 #结构化数据生成 #多语言支持

下载量 113

发布时间 : 5/28/2025

模型简介

NuExtract 2.0支持从文本或图像中提取结构化信息，通过JSON模板指定字段和类型，适用于多种信息提取场景。

模型特点

多模态支持

同时支持文本和图像输入的信息提取

多语言能力

具备处理多种语言输入的能力

模板驱动

通过JSON模板灵活定义需要提取的字段和类型

上下文示例支持

可通过提供示例指导模型理解特定格式要求

模型能力

文本信息提取

图像信息提取

多语言处理

批量推理

模板生成

使用案例

文档处理

姓名提取

从文本文档中提取所有人名

准确识别并返回文档中所有人名

合同信息提取

从合同文档中提取关键条款和日期

结构化输出合同关键信息

图像分析

商店标识识别

从店铺照片中识别商店名称

准确提取店铺名称信息

🚀 NuExtract 2.0 2B by NuMind

NuExtract 2.0是专门为结构化信息提取任务训练的一系列模型。它支持多模态输入，并且具备多语言处理能力。我们提供了几种不同规模的版本，所有版本均基于QwenVL系列的预训练模型。

🚀 快速开始

要使用该模型，需提供输入文本或图像，以及一个描述所需提取信息的JSON模板。模板应为一个JSON对象，指定字段名称及其预期类型。

✨ 主要特性

多模态支持：支持文本和图像输入。
多语言能力：具备处理多种语言的能力。
多种版本可选：提供2B、4B、8B等不同规模的模型。

📦 安装指南

文档未提及具体安装步骤，可参考相关代码示例中的依赖导入部分。

💻 使用示例

基础用法

以下是一个从文本文档中提取姓名的基础示例：

template = """{"names": ["string"]}"""
document = "John went to the restaurant with Mary. James went to the cinema."

# prepare the user message content
messages = [{"role": "user", "content": document}]
text = processor.tokenizer.apply_chat_template(
    messages,
    template=template, # template is specified here
    tokenize=False,
    add_generation_prompt=True,
)

print(text)
""""<|im_start|>user
# Template:
{"names": ["string"]}
# Context:
John went to the restaurant with Mary. James went to the cinema.<|im_end|> 
<|im_start|>assistant"""

image_inputs = process_all_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    padding=True,
    return_tensors="pt",
).to("cuda")

# we choose greedy sampling here, which works well for most information extraction tasks
generation_config = {"do_sample": False, "num_beams": 1, "max_new_tokens": 2048}

# Inference: Generation of the output
generated_ids = model.generate(
    **inputs,
    **generation_config
)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)

print(output_text)
# ['{"names": ["John", "Mary", "James"]}']

高级用法

上下文示例

有时，由于任务具有挑战性或存在一定的模糊性，模型的表现可能不尽如人意。或者，我们可能希望模型遵循特定的格式，这时提供“上下文示例”可以帮助NuExtract更好地理解任务。

template = """{"names": ["string"]}"""
document = "John went to the restaurant with Mary. James went to the cinema."
examples = [
    {
        "input": "Stephen is the manager at Susan's store.",
        "output": """{"names": ["-STEPHEN-", "-SUSAN-"]}"""
    }
]

messages = [{"role": "user", "content": document}]
text = processor.tokenizer.apply_chat_template(
    messages,
    template=template,
    examples=examples, # examples provided here
    tokenize=False,
    add_generation_prompt=True,
)

image_inputs = process_all_vision_info(messages, examples)
inputs = processor(
    text=[text],
    images=image_inputs,
    padding=True,
    return_tensors="pt",
).to("cuda")

# we choose greedy sampling here, which works well for most information extraction tasks
generation_config = {"do_sample": False, "num_beams": 1, "max_new_tokens": 2048}

# Inference: Generation of the output
generated_ids = model.generate(
    **inputs,
    **generation_config
)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
# ['{"names": ["-JOHN-", "-MARY-", "-JAMES-"]}']

图像输入

如果想向NuExtract提供图像输入，只需提供一个指定所需图像文件的字典作为消息内容，而不是字符串。

template = """{"store": "verbatim-string"}"""
document = {"type": "image", "image": "file://1.jpg"}

messages = [{"role": "user", "content": [document]}]
text = processor.tokenizer.apply_chat_template(
    messages,
    template=template,
    tokenize=False,
    add_generation_prompt=True,
)

image_inputs = process_all_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    padding=True,
    return_tensors="pt",
).to("cuda")

generation_config = {"do_sample": False, "num_beams": 1, "max_new_tokens": 2048}

# Inference: Generation of the output
generated_ids = model.generate(
    **inputs,
    **generation_config
)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
# ['{"store": "Trader Joe\'s"}']

批量推理

inputs = [
    # image input with no ICL examples
    {
        "document": {"type": "image", "image": "file://0.jpg"},
        "template": """{"store_name": "verbatim-string"}""",
    },
    # image input with 1 ICL example
    {
        "document": {"type": "image", "image": "file://0.jpg"},
        "template": """{"store_name": "verbatim-string"}""",
        "examples": [
            {
                "input": {"type": "image", "image": "file://1.jpg"},
                "output": """{"store_name": "Trader Joe's"}""",
            }
        ],
    },
    # text input with no ICL examples
    {
        "document": {"type": "text", "text": "John went to the restaurant with Mary. James went to the cinema."},
        "template": """{"names": ["string"]}""",
    },
    # text input with ICL example
    {
        "document": {"type": "text", "text": "John went to the restaurant with Mary. James went to the cinema."},
        "template": """{"names": ["string"]}""",
        "examples": [
            {
                "input": "Stephen is the manager at Susan's store.",
                "output": """{"names": ["STEPHEN", "SUSAN"]}"""
            }
        ],
    },
]

# messages should be a list of lists for batch processing
messages = [
    [
        {
            "role": "user",
            "content": [x['document']],
        }
    ]
    for x in inputs
]

# apply chat template to each example individually
texts = [
    processor.tokenizer.apply_chat_template(
        messages[i],  # Now this is a list containing one message
        template=x['template'],
        examples=x.get('examples', None),
        tokenize=False, 
        add_generation_prompt=True)
    for i, x in enumerate(inputs)
]

image_inputs = process_all_vision_info(messages, [x.get('examples') for x in inputs])
inputs = processor(
    text=texts,
    images=image_inputs,
    padding=True,
    return_tensors="pt",
).to("cuda")

generation_config = {"do_sample": False, "num_beams": 1, "max_new_tokens": 2048}

# Batch Inference
generated_ids = model.generate(**inputs, **generation_config)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_texts = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
for y in output_texts:
    print(y)
# {"store_name": "WAL-MART"}
# {"store_name": "Walmart"}
# {"names": ["John", "Mary", "James"]}
# {"names": ["JOHN", "MARY", "JAMES"]}

模板生成

如果想将现有的其他格式的模式文件（如XML、YAML等）转换为NuExtract模板，或者从示例开始，NuExtract 2.0模型可以自动为你生成。

xml_template = """<SportResult>
    <Date></Date>
    <Sport></Sport>
    <Venue></Venue>
    <HomeTeam></HomeTeam>
    <AwayTeam></AwayTeam>
    <HomeScore></HomeScore>
    <AwayScore></AwayScore>
    <TopScorer></TopScorer>
</SportResult>"""

messages = [
        {
            "role": "user",
            "content": [{"type": "text", "text": xml_template}],
        }
    ]

text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True,
)

image_inputs = process_all_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    padding=True,
    return_tensors="pt",
).to("cuda")

generated_ids = model.generate(
    **inputs,
    **generation_config
)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)

print(output_text[0])
# {
#     "Date": "date-time",
#     "Sport": "verbatim-string",
#     "Venue": "verbatim-string",
#     "HomeTeam": "verbatim-string",
#     "AwayTeam": "verbatim-string",
#     "HomeScore": "integer",
#     "AwayScore": "integer",
#     "TopScorer": "verbatim-string"
# }

从自然语言描述生成模板：

description = "I would like to extract important details from the contract."

messages = [
        {
            "role": "user",
            "content": [{"type": "text", "text": description}],
        }
    ]

text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True,
)

image_inputs = process_all_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    padding=True,
    return_tensors="pt",
).to("cuda")

generated_ids = model.generate(
    **inputs,
    **generation_config
)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)

print(output_text[0])
# {
#     "Contract": {
#         "Title": "verbatim-string",
#         "Description": "verbatim-string",
#         "Terms": [
#             {
#                 "Term": "verbatim-string",
#                 "Description": "verbatim-string"
#             }
#         ],
#         "Date": "date-time",
#         "Signatory": "verbatim-string"
#     }
# }

📚 详细文档

支持的类型

支持的类型包括：

verbatim-string：指示模型提取输入中逐字存在的文本。
string：一个通用的字符串字段，可以包含释义或抽象内容。
integer：一个整数。
number：一个整数或小数。
date-time：ISO格式的日期。
上述任何类型的数组（例如 ["string"]）
enum：从一组可能的答案中选择（在模板中表示为选项数组，例如 ["yes", "no", "maybe"]）。
multi-label：一个可以有多个可能答案的枚举（在模板中表示为双重包装的数组，例如 [["A", "B", "C"]]）。

如果模型未识别出某个字段的相关信息，将返回 null 或 []（对于数组和多标签）。

模型版本信息

模型规模	模型名称	基础模型	许可证	Huggingface链接
2B	NuExtract-2.0-2B	Qwen2-VL-2B-Instruct	MIT	NuExtract-2.0-2B
4B	NuExtract-2.0-4B	Qwen2.5-VL-3B-Instruct	Qwen研究许可证	NuExtract-2.0-4B
8B	NuExtract-2.0-8B	Qwen2.5-VL-7B-Instruct	MIT	NuExtract-2.0-8B

注意事项

NuExtract-2.0-2B 基于Qwen2-VL而不是Qwen2.5-VL，因为最小的Qwen2.5-VL模型（3B）具有更严格的非商业许可证。因此，我们将 NuExtract-2.0-2B 作为可商业使用的小模型选项。
建议在使用NuExtract时将温度设置为0或非常接近0。一些推理框架（如Ollama）默认使用0.7的温度，这不适用于许多提取任务。

🔧 技术细节

微调

可以在GitHub仓库的cookbooks文件夹中找到微调教程笔记本。

vLLM部署

运行以下命令以提供与OpenAI兼容的API：

vllm serve numind/NuExtract-2.0-8B --trust_remote_code --limit-mm-per-prompt image=6 --chat-template-content-format openai

如果遇到内存问题，请相应地设置 --max-model-len。

向模型发送请求的示例：

import json
from openai import OpenAI

openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

chat_response = client.chat.completions.create(
    model="numind/NuExtract-2.0-8B",
    temperature=0,
    messages=[
        {
            "role": "user", 
            "content": [{"type": "text", "text": "Yesterday I went shopping at Bunnings"}],
        },
    ],
    extra_body={
        "chat_template_kwargs": {
            "template": json.dumps(json.loads("""{\"store\": \"verbatim-string\"}"""), indent=4)
        },
    }
)
print("Chat response:", chat_response)

对于图像输入，请求结构如下。确保在 "content" 中按提示中出现的顺序排列图像（即任何上下文示例在主输入之前）。

import base64

def encode_image(image_path):
    """
    Encode the image file to base64 string
    """
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode('utf-8')

base64_image = encode_image("0.jpg")
base64_image2 = encode_image("1.jpg")

chat_response = client.chat.completions.create(
    model="numind/NuExtract-2.0-8B",
    temperature=0,
    messages=[
        {
            "role": "user", 
            "content": [
                {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{base64_image}"}}, # first ICL example image
            ],
        },
    ],
    extra_body={
        "chat_template_kwargs": {
            "template": json.dumps(json.loads("""{\"store\": \"verbatim-string\"}"""), indent=4)
        },
    }
)
print("Chat response:", chat_response)