NuExtract-2.0-4B開源多模態模型 - 免費部署多語言結構化信息提取

首頁

Nuextract 2.0 4B

由numind開發

NuExtract 2.0是專為結構化信息提取任務訓練的多模態模型系列，支持文本和圖像輸入，具備多語言處理能力。

圖像生成文本

Transformers

開源協議:MIT #結構化信息提取 #多模態處理 #多語言支持

下載量 272

發布時間 : 5/26/2025

模型概述

NuExtract 2.0是基於QwenVL系列預訓練模型優化的信息提取模型，支持通過JSON模板從文本或圖像中提取結構化數據。

模型特點

多模態支持

同時支持文本和圖像輸入，可從多種數據源提取信息

模板驅動

通過JSON模板定義提取字段和類型，靈活適應不同場景需求

多語言處理

具備處理多種語言的能力

上下文學習

支持通過示例指導模型理解複雜任務

模型能力

文本信息提取

圖像信息提取

結構化數據生成

多語言處理

批量推理

使用案例

文檔處理

合同信息提取

從法律合同中提取關鍵條款、日期和簽署方信息

生成結構化JSON數據

簡歷解析

從簡歷文本中提取姓名、教育背景和工作經歷

標準化的人才數據庫錄入

商業應用

收據識別

從收據圖像中提取商家名稱、金額和日期

自動化的費用報銷處理

產品信息提取

從產品包裝圖像中提取規格、成分等信息

電商產品目錄自動化

🚀 NuExtract 2.0 4B

NuExtract 2.0是專門為結構化信息提取任務訓練的一系列模型。它支持多模態輸入，並且具備多語言處理能力。我們提供了多種不同規模的版本，所有版本均基於QwenVL系列的預訓練模型。

🚀 快速開始

若要使用該模型，需提供輸入文本或圖像，以及描述所需提取信息的JSON模板。模板應為JSON對象，需指定字段名稱及其預期類型。

支持的類型包括：

verbatim-string：指示模型提取輸入中逐字存在的文本。
string：通用字符串字段，可包含釋義或抽象內容。
integer：整數。
number：整數或小數。
date-time：ISO格式的日期。
上述任何類型的數組（例如["string"]）
enum：從一組可能的答案中進行選擇（在模板中表示為選項數組，例如["yes", "no", "maybe"]）。
multi-label：一種可以有多個可能答案的枚舉類型（在模板中表示為雙重包裝的數組，例如[["A", "B", "C"]]）。

如果模型未識別出某個字段的相關信息，則會返回null或[]（適用於數組和多標籤）。

以下是一個示例模板：

{
  "first_name": "verbatim-string",
  "last_name": "verbatim-string",
  "description": "string",
  "age": "integer",
  "gpa": "number",
  "birth_date": "date-time",
  "nationality": ["France", "England", "Japan", "USA", "China"],
  "languages_spoken": [["English", "French", "Japanese", "Mandarin", "Spanish"]]
}

示例輸出：

{
  "first_name": "Susan",
  "last_name": "Smith",
  "description": "A student studying computer science.",
  "age": 20,
  "gpa": 3.7,
  "birth_date": "2005-03-01",
  "nationality": "England",
  "languages_spoken": ["English", "French"]
}

⚠️ 重要提示

建議將NuExtract的溫度設置為0或非常接近0的值。一些推理框架（如Ollama）默認使用0.7的溫度，這可能不太適合許多提取任務。

✨ 主要特性

多模態支持：支持文本和圖像輸入。
多語言處理：具備多語言處理能力。
多種模型規模：提供2B、4B和8B等不同規模的模型。

📦 安裝指南

使用transformers庫加載模型：

import torch
from transformers import AutoProcessor, AutoModelForVision2Seq

model_name = "numind/NuExtract-2.0-2B"
# model_name = "numind/NuExtract-2.0-8B"

model = AutoModelForVision2Seq.from_pretrained(model_name, 
                                               trust_remote_code=True, 
                                               torch_dtype=torch.bfloat16,
                                               attn_implementation="flash_attention_2",
                                               device_map="auto")
processor = AutoProcessor.from_pretrained(model_name, 
                                          trust_remote_code=True, 
                                          padding_side='left',
                                          use_fast=True)

# 可根據需要設置min_pixels和max_pixels，例如令牌範圍為256 - 1280，以平衡性能和成本。
# min_pixels = 256*28*28
# max_pixels = 1280*28*28
# processor = AutoProcessor.from_pretrained(model_name, min_pixels=min_pixels, max_pixels=max_pixels)

💻 使用示例

基礎用法

例如，從文本文件中提取姓名：

template = """{"names": ["string"]}"""
document = "John went to the restaurant with Mary. James went to the cinema."

# 準備用戶消息內容
messages = [{"role": "user", "content": document}]
text = processor.tokenizer.apply_chat_template(
    messages,
    template=template, # 在此處指定模板
    tokenize=False,
    add_generation_prompt=True,
)

print(text)
""""<|im_start|>user
# Template:
{"names": ["string"]}
# Context:
John went to the restaurant with Mary. James went to the cinema.<|im_end|> 
<|im_start|>assistant"""

image_inputs = process_all_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    padding=True,
    return_tensors="pt",
).to("cuda")

# 這裡選擇貪心採樣，這適用於大多數信息提取任務
generation_config = {"do_sample": False, "num_beams": 1, "max_new_tokens": 2048}

# 推理：生成輸出
generated_ids = model.generate(
    **inputs,
    **generation_config
)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)

print(output_text)
# ['{"names": ["John", "Mary", "James"]}']

高級用法

上下文示例

有時，由於任務具有挑戰性或存在一定的模糊性，模型的表現可能不盡如人意。或者，我們可能希望模型遵循特定的格式，或者為其提供更多幫助。在這種情況下，提供“上下文示例”可以幫助NuExtract更好地理解任務。

template = """{"names": ["string"]}"""
document = "John went to the restaurant with Mary. James went to the cinema."
examples = [
    {
        "input": "Stephen is the manager at Susan's store.",
        "output": """{"names": ["-STEPHEN-", "-SUSAN-"]}"""
    }
]

messages = [{"role": "user", "content": document}]
text = processor.tokenizer.apply_chat_template(
    messages,
    template=template,
    examples=examples, # 在此處提供示例
    tokenize=False,
    add_generation_prompt=True,
)

image_inputs = process_all_vision_info(messages, examples)
inputs = processor(
    text=[text],
    images=image_inputs,
    padding=True,
    return_tensors="pt",
).to("cuda")

# 這裡選擇貪心採樣，這適用於大多數信息提取任務
generation_config = {"do_sample": False, "num_beams": 1, "max_new_tokens": 2048}

# 推理：生成輸出
generated_ids = model.generate(
    **inputs,
    **generation_config
)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
# ['{"names": ["-JOHN-", "-MARY-", "-JAMES-"]}']

圖像輸入

如果要向NuExtract提供圖像輸入，只需將指定所需圖像文件的字典作為消息內容，而不是字符串。（例如{"type": "image", "image": "file://image.jpg"}）。

template = """{"store": "verbatim-string"}"""
document = {"type": "image", "image": "file://1.jpg"}

messages = [{"role": "user", "content": [document]}]
text = processor.tokenizer.apply_chat_template(
    messages,
    template=template,
    tokenize=False,
    add_generation_prompt=True,
)

image_inputs = process_all_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    padding=True,
    return_tensors="pt",
).to("cuda")

generation_config = {"do_sample": False, "num_beams": 1, "max_new_tokens": 2048}

# 推理：生成輸出
generated_ids = model.generate(
    **inputs,
    **generation_config
)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
# ['{"store": "Trader Joe\'s"}']

批量推理

inputs = [
    # 無上下文學習示例的圖像輸入
    {
        "document": {"type": "image", "image": "file://0.jpg"},
        "template": """{"store_name": "verbatim-string"}""",
    },
    # 有1個上下文學習示例的圖像輸入
    {
        "document": {"type": "image", "image": "file://0.jpg"},
        "template": """{"store_name": "verbatim-string"}""",
        "examples": [
            {
                "input": {"type": "image", "image": "file://1.jpg"},
                "output": """{"store_name": "Trader Joe's"}""",
            }
        ],
    },
    # 無上下文學習示例的文本輸入
    {
        "document": {"type": "text", "text": "John went to the restaurant with Mary. James went to the cinema."},
        "template": """{"names": ["string"]}""",
    },
    # 有上下文學習示例的文本輸入
    {
        "document": {"type": "text", "text": "John went to the restaurant with Mary. James went to the cinema."},
        "template": """{"names": ["string"]}""",
        "examples": [
            {
                "input": "Stephen is the manager at Susan's store.",
                "output": """{"names": ["STEPHEN", "SUSAN"]}"""
            }
        ],
    },
]

# 批量處理時，messages應為列表的列表
messages = [
    [
        {
            "role": "user",
            "content": [x['document']],
        }
    ]
    for x in inputs
]

# 分別對每個示例應用聊天模板
texts = [
    processor.tokenizer.apply_chat_template(
        messages[i],  # 現在這是一個包含一條消息的列表
        template=x['template'],
        examples=x.get('examples', None),
        tokenize=False, 
        add_generation_prompt=True)
    for i, x in enumerate(inputs)
]

image_inputs = process_all_vision_info(messages, [x.get('examples') for x in inputs])
inputs = processor(
    text=texts,
    images=image_inputs,
    padding=True,
    return_tensors="pt",
).to("cuda")

generation_config = {"do_sample": False, "num_beams": 1, "max_new_tokens": 2048}

# 批量推理
generated_ids = model.generate(**inputs, **generation_config)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_texts = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
for y in output_texts:
    print(y)
# {"store_name": "WAL-MART"}
# {"store_name": "Walmart"}
# {"names": ["John", "Mary", "James"]}
# {"names": ["JOHN", "MARY", "JAMES"]}

模板生成

如果你想將現有的其他格式（如XML、YAML等）的模式文件進行轉換，或者從示例開始，NuExtract 2.0模型可以自動為你生成模板。

# 將XML轉換為NuExtract模板
xml_template = """<SportResult>
    <Date></Date>
    <Sport></Sport>
    <Venue></Venue>
    <HomeTeam></HomeTeam>
    <AwayTeam></AwayTeam>
    <HomeScore></HomeScore>
    <AwayScore></AwayScore>
    <TopScorer></TopScorer>
</SportResult>"""

messages = [
        {
            "role": "user",
            "content": [{"type": "text", "text": xml_template}],
        }
    ]

text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True,
)

image_inputs = process_all_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    padding=True,
    return_tensors="pt",
).to("cuda")

generated_ids = model.generate(
    **inputs,
    **generation_config
)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)

print(output_text[0])
# {
#     "Date": "date-time",
#     "Sport": "verbatim-string",
#     "Venue": "verbatim-string",
#     "HomeTeam": "verbatim-string",
#     "AwayTeam": "verbatim-string",
#     "HomeScore": "integer",
#     "AwayScore": "integer",
#     "TopScorer": "verbatim-string"
# }

# 從自然語言描述生成模板
description = "I would like to extract important details from the contract."

messages = [
        {
            "role": "user",
            "content": [{"type": "text", "text": description}],
        }
    ]

text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True,
)

image_inputs = process_all_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    padding=True,
    return_tensors="pt",
).to("cuda")

generated_ids = model.generate(
    **inputs,
    **generation_config
)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)

print(output_text[0])
# {
#     "Contract": {
#         "Title": "verbatim-string",
#         "Description": "verbatim-string",
#         "Terms": [
#             {
#                 "Term": "verbatim-string",
#                 "Description": "verbatim-string"
#             }
#         ],
#         "Date": "date-time",
#         "Signatory": "verbatim-string"
#     }
# }

📚 詳細文檔

模型版本

模型規模	模型名稱	基礎模型	許可證	Huggingface鏈接
2B	NuExtract-2.0-2B	Qwen2-VL-2B-Instruct	MIT	NuExtract-2.0-2B
4B	NuExtract-2.0-4B	Qwen2.5-VL-3B-Instruct	Qwen研究許可證	NuExtract-2.0-4B
8B	NuExtract-2.0-8B	Qwen2.5-VL-7B-Instruct	MIT	NuExtract-2.0-8B

基準測試

在包含約1000個不同提取示例（包含文本和圖像輸入）的集合上的性能表現：基準測試

🔧 技術細節

微調

你可以在GitHub倉庫的cookbooks文件夾中找到微調教程筆記本。

vLLM部署

運行以下命令以提供與OpenAI兼容的API：

vllm serve numind/NuExtract-2.0-8B --trust_remote_code --limit-mm-per-prompt image=6 --chat-template-content-format openai

如果你遇到內存問題，請相應地設置--max-model-len。

發送請求到模型的示例代碼：

import json
from openai import OpenAI

openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

chat_response = client.chat.completions.create(
    model="numind/NuExtract-2.0-8B",
    temperature=0,
    messages=[
        {
            "role": "user", 
            "content": [{"type": "text", "text": "Yesterday I went shopping at Bunnings"}],
        },
    ],
    extra_body={
        "chat_template_kwargs": {
            "template": json.dumps(json.loads("""{\"store\": \"verbatim-string\"}"""), indent=4)
        },
    }
)
print("Chat response:", chat_response)

對於圖像輸入，請求結構如下所示。確保在"content"中按它們在提示中出現的順序排列圖像（即任何上下文示例在主輸入之前）。

import base64

def encode_image(image_path):
    """
    將圖像文件編碼為base64字符串
    """
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode('utf-8')

base64_image = encode_image("0.jpg")
base64_image2 = encode_image("1.jpg")

chat_response = client.chat.completions.create(
    model="numind/NuExtract-2.0-8B",
    temperature=0,
    messages=[
        {
            "role": "user", 
            "content": [
                {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{base64_image}"}}, # 第一個上下文學習示例圖像
                # 其他圖像...
            ]
        }
    ]
)