模型概述
模型特點
模型能力
使用案例
🚀 NuExtract 2.0 4B
NuExtract 2.0是專門為結構化信息提取任務訓練的一系列模型。它支持多模態輸入,並且具備多語言處理能力。我們提供了多種不同規模的版本,所有版本均基於QwenVL系列的預訓練模型。
🚀 快速開始
若要使用該模型,需提供輸入文本或圖像,以及描述所需提取信息的JSON模板。模板應為JSON對象,需指定字段名稱及其預期類型。
支持的類型包括:
verbatim-string
:指示模型提取輸入中逐字存在的文本。string
:通用字符串字段,可包含釋義或抽象內容。integer
:整數。number
:整數或小數。date-time
:ISO格式的日期。- 上述任何類型的數組(例如
["string"]
) enum
:從一組可能的答案中進行選擇(在模板中表示為選項數組,例如["yes", "no", "maybe"]
)。multi-label
:一種可以有多個可能答案的枚舉類型(在模板中表示為雙重包裝的數組,例如[["A", "B", "C"]]
)。
如果模型未識別出某個字段的相關信息,則會返回null
或[]
(適用於數組和多標籤)。
以下是一個示例模板:
{
"first_name": "verbatim-string",
"last_name": "verbatim-string",
"description": "string",
"age": "integer",
"gpa": "number",
"birth_date": "date-time",
"nationality": ["France", "England", "Japan", "USA", "China"],
"languages_spoken": [["English", "French", "Japanese", "Mandarin", "Spanish"]]
}
示例輸出:
{
"first_name": "Susan",
"last_name": "Smith",
"description": "A student studying computer science.",
"age": 20,
"gpa": 3.7,
"birth_date": "2005-03-01",
"nationality": "England",
"languages_spoken": ["English", "French"]
}
⚠️ 重要提示
建議將NuExtract的溫度設置為0或非常接近0的值。一些推理框架(如Ollama)默認使用0.7的溫度,這可能不太適合許多提取任務。
✨ 主要特性
- 多模態支持:支持文本和圖像輸入。
- 多語言處理:具備多語言處理能力。
- 多種模型規模:提供2B、4B和8B等不同規模的模型。
📦 安裝指南
使用transformers
庫加載模型:
import torch
from transformers import AutoProcessor, AutoModelForVision2Seq
model_name = "numind/NuExtract-2.0-2B"
# model_name = "numind/NuExtract-2.0-8B"
model = AutoModelForVision2Seq.from_pretrained(model_name,
trust_remote_code=True,
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
device_map="auto")
processor = AutoProcessor.from_pretrained(model_name,
trust_remote_code=True,
padding_side='left',
use_fast=True)
# 可根據需要設置min_pixels和max_pixels,例如令牌範圍為256 - 1280,以平衡性能和成本。
# min_pixels = 256*28*28
# max_pixels = 1280*28*28
# processor = AutoProcessor.from_pretrained(model_name, min_pixels=min_pixels, max_pixels=max_pixels)
💻 使用示例
基礎用法
例如,從文本文件中提取姓名:
template = """{"names": ["string"]}"""
document = "John went to the restaurant with Mary. James went to the cinema."
# 準備用戶消息內容
messages = [{"role": "user", "content": document}]
text = processor.tokenizer.apply_chat_template(
messages,
template=template, # 在此處指定模板
tokenize=False,
add_generation_prompt=True,
)
print(text)
""""<|im_start|>user
# Template:
{"names": ["string"]}
# Context:
John went to the restaurant with Mary. James went to the cinema.<|im_end|>
<|im_start|>assistant"""
image_inputs = process_all_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
padding=True,
return_tensors="pt",
).to("cuda")
# 這裡選擇貪心採樣,這適用於大多數信息提取任務
generation_config = {"do_sample": False, "num_beams": 1, "max_new_tokens": 2048}
# 推理:生成輸出
generated_ids = model.generate(
**inputs,
**generation_config
)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
# ['{"names": ["John", "Mary", "James"]}']
高級用法
上下文示例
有時,由於任務具有挑戰性或存在一定的模糊性,模型的表現可能不盡如人意。或者,我們可能希望模型遵循特定的格式,或者為其提供更多幫助。在這種情況下,提供“上下文示例”可以幫助NuExtract更好地理解任務。
template = """{"names": ["string"]}"""
document = "John went to the restaurant with Mary. James went to the cinema."
examples = [
{
"input": "Stephen is the manager at Susan's store.",
"output": """{"names": ["-STEPHEN-", "-SUSAN-"]}"""
}
]
messages = [{"role": "user", "content": document}]
text = processor.tokenizer.apply_chat_template(
messages,
template=template,
examples=examples, # 在此處提供示例
tokenize=False,
add_generation_prompt=True,
)
image_inputs = process_all_vision_info(messages, examples)
inputs = processor(
text=[text],
images=image_inputs,
padding=True,
return_tensors="pt",
).to("cuda")
# 這裡選擇貪心採樣,這適用於大多數信息提取任務
generation_config = {"do_sample": False, "num_beams": 1, "max_new_tokens": 2048}
# 推理:生成輸出
generated_ids = model.generate(
**inputs,
**generation_config
)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
# ['{"names": ["-JOHN-", "-MARY-", "-JAMES-"]}']
圖像輸入
如果要向NuExtract提供圖像輸入,只需將指定所需圖像文件的字典作為消息內容,而不是字符串。(例如{"type": "image", "image": "file://image.jpg"}
)。
template = """{"store": "verbatim-string"}"""
document = {"type": "image", "image": "file://1.jpg"}
messages = [{"role": "user", "content": [document]}]
text = processor.tokenizer.apply_chat_template(
messages,
template=template,
tokenize=False,
add_generation_prompt=True,
)
image_inputs = process_all_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
padding=True,
return_tensors="pt",
).to("cuda")
generation_config = {"do_sample": False, "num_beams": 1, "max_new_tokens": 2048}
# 推理:生成輸出
generated_ids = model.generate(
**inputs,
**generation_config
)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
# ['{"store": "Trader Joe\'s"}']
批量推理
inputs = [
# 無上下文學習示例的圖像輸入
{
"document": {"type": "image", "image": "file://0.jpg"},
"template": """{"store_name": "verbatim-string"}""",
},
# 有1個上下文學習示例的圖像輸入
{
"document": {"type": "image", "image": "file://0.jpg"},
"template": """{"store_name": "verbatim-string"}""",
"examples": [
{
"input": {"type": "image", "image": "file://1.jpg"},
"output": """{"store_name": "Trader Joe's"}""",
}
],
},
# 無上下文學習示例的文本輸入
{
"document": {"type": "text", "text": "John went to the restaurant with Mary. James went to the cinema."},
"template": """{"names": ["string"]}""",
},
# 有上下文學習示例的文本輸入
{
"document": {"type": "text", "text": "John went to the restaurant with Mary. James went to the cinema."},
"template": """{"names": ["string"]}""",
"examples": [
{
"input": "Stephen is the manager at Susan's store.",
"output": """{"names": ["STEPHEN", "SUSAN"]}"""
}
],
},
]
# 批量處理時,messages應為列表的列表
messages = [
[
{
"role": "user",
"content": [x['document']],
}
]
for x in inputs
]
# 分別對每個示例應用聊天模板
texts = [
processor.tokenizer.apply_chat_template(
messages[i], # 現在這是一個包含一條消息的列表
template=x['template'],
examples=x.get('examples', None),
tokenize=False,
add_generation_prompt=True)
for i, x in enumerate(inputs)
]
image_inputs = process_all_vision_info(messages, [x.get('examples') for x in inputs])
inputs = processor(
text=texts,
images=image_inputs,
padding=True,
return_tensors="pt",
).to("cuda")
generation_config = {"do_sample": False, "num_beams": 1, "max_new_tokens": 2048}
# 批量推理
generated_ids = model.generate(**inputs, **generation_config)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_texts = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
for y in output_texts:
print(y)
# {"store_name": "WAL-MART"}
# {"store_name": "Walmart"}
# {"names": ["John", "Mary", "James"]}
# {"names": ["JOHN", "MARY", "JAMES"]}
模板生成
如果你想將現有的其他格式(如XML、YAML等)的模式文件進行轉換,或者從示例開始,NuExtract 2.0模型可以自動為你生成模板。
# 將XML轉換為NuExtract模板
xml_template = """<SportResult>
<Date></Date>
<Sport></Sport>
<Venue></Venue>
<HomeTeam></HomeTeam>
<AwayTeam></AwayTeam>
<HomeScore></HomeScore>
<AwayScore></AwayScore>
<TopScorer></TopScorer>
</SportResult>"""
messages = [
{
"role": "user",
"content": [{"type": "text", "text": xml_template}],
}
]
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True,
)
image_inputs = process_all_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
padding=True,
return_tensors="pt",
).to("cuda")
generated_ids = model.generate(
**inputs,
**generation_config
)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text[0])
# {
# "Date": "date-time",
# "Sport": "verbatim-string",
# "Venue": "verbatim-string",
# "HomeTeam": "verbatim-string",
# "AwayTeam": "verbatim-string",
# "HomeScore": "integer",
# "AwayScore": "integer",
# "TopScorer": "verbatim-string"
# }
# 從自然語言描述生成模板
description = "I would like to extract important details from the contract."
messages = [
{
"role": "user",
"content": [{"type": "text", "text": description}],
}
]
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True,
)
image_inputs = process_all_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
padding=True,
return_tensors="pt",
).to("cuda")
generated_ids = model.generate(
**inputs,
**generation_config
)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text[0])
# {
# "Contract": {
# "Title": "verbatim-string",
# "Description": "verbatim-string",
# "Terms": [
# {
# "Term": "verbatim-string",
# "Description": "verbatim-string"
# }
# ],
# "Date": "date-time",
# "Signatory": "verbatim-string"
# }
# }
📚 詳細文檔
模型版本
模型規模 | 模型名稱 | 基礎模型 | 許可證 | Huggingface鏈接 |
---|---|---|---|---|
2B | NuExtract-2.0-2B | Qwen2-VL-2B-Instruct | MIT | NuExtract-2.0-2B |
4B | NuExtract-2.0-4B | Qwen2.5-VL-3B-Instruct | Qwen研究許可證 | NuExtract-2.0-4B |
8B | NuExtract-2.0-8B | Qwen2.5-VL-7B-Instruct | MIT | NuExtract-2.0-8B |
基準測試
在包含約1000個不同提取示例(包含文本和圖像輸入)的集合上的性能表現:
🔧 技術細節
微調
你可以在GitHub倉庫的cookbooks文件夾中找到微調教程筆記本。
vLLM部署
運行以下命令以提供與OpenAI兼容的API:
vllm serve numind/NuExtract-2.0-8B --trust_remote_code --limit-mm-per-prompt image=6 --chat-template-content-format openai
如果你遇到內存問題,請相應地設置--max-model-len
。
發送請求到模型的示例代碼:
import json
from openai import OpenAI
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"
client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
chat_response = client.chat.completions.create(
model="numind/NuExtract-2.0-8B",
temperature=0,
messages=[
{
"role": "user",
"content": [{"type": "text", "text": "Yesterday I went shopping at Bunnings"}],
},
],
extra_body={
"chat_template_kwargs": {
"template": json.dumps(json.loads("""{\"store\": \"verbatim-string\"}"""), indent=4)
},
}
)
print("Chat response:", chat_response)
對於圖像輸入,請求結構如下所示。確保在"content"
中按它們在提示中出現的順序排列圖像(即任何上下文示例在主輸入之前)。
import base64
def encode_image(image_path):
"""
將圖像文件編碼為base64字符串
"""
with open(image_path, "rb") as image_file:
return base64.b64encode(image_file.read()).decode('utf-8')
base64_image = encode_image("0.jpg")
base64_image2 = encode_image("1.jpg")
chat_response = client.chat.completions.create(
model="numind/NuExtract-2.0-8B",
temperature=0,
messages=[
{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{base64_image}"}}, # 第一個上下文學習示例圖像
# 其他圖像...
]
}
]
)
📄 許可證
本項目使用的許可證信息如下:
模型名稱 | 許可證 |
---|---|
NuExtract-2.0-2B | MIT |
NuExtract-2.0-4B | Qwen研究許可證 |
NuExtract-2.0-8B | MIT |
需要注意的是,NuExtract-2.0-2B
基於Qwen2-VL而非Qwen2.5-VL,因為最小的Qwen2.5-VL模型(3B)具有更嚴格的非商業許可證。因此,我們將NuExtract-2.0-2B
作為可以商業使用的小模型選項。








