Vintern-3B-R-beta开源多模态模型 - 免费部署助力图像复杂推理任务

首页

Vintern 3B R Beta

由 5CD-AI 开发

Vintern-3B-R-beta是一个多模态大语言模型，专注于基于图像的复杂推理任务，能分解推理步骤并有效控制幻觉现象。

图像生成文本

Transformers

支持多种语言开源协议:MIT #多模态推理 #越南语OCR #结构化文档解析

下载量 1,841

发布时间 : 3/19/2025

模型简介

该模型结合了视觉和语言处理能力，擅长处理结构化文档图像和复杂问题推理，支持越南语、英语和中文。

模型特点

复杂推理能力

能够基于图像进行长链条复杂推理，将推理步骤分解为多个子步骤

多语言支持

支持越南语、英语和中文三种语言处理

幻觉控制

在推理过程中有效控制幻觉现象的产生

多模态处理

结合视觉和语言处理能力，处理结构化文档图像

模型能力

图像理解

复杂推理

多语言文本生成

结构化文档处理

OCR文本提取

使用案例

餐饮行业

菜单价格分析

从餐厅菜单图像中提取菜品信息并比较价格

准确识别最高价菜品

政府文档处理

公文文本提取

从政府公文图像中提取完整文本内容

准确提取越南语政府公文内容

🚀 Vintern推理模型

Vintern推理模型是一个多模态大语言模型与推理模型的结合体，它能够基于图像进行长而复杂的推理，将每个推理步骤分解为多个子步骤，同时控制幻觉的产生。该模型在多种基准测试中表现出色，为越南语OCR和复杂问题解决提供了强大的支持。

🚀 快速开始

这里提供了一段代码片段，展示如何加载分词器和模型，以及如何生成内容。要使用该模型进行推理，请按照我们Colab推理笔记本中概述的步骤操作。

import numpy as np
import torch
import torchvision.transforms as T
# from decord import VideoReader, cpu
from PIL import Image
from torchvision.transforms.functional import InterpolationMode
from transformers import AutoModel, AutoTokenizer

IMAGENET_MEAN = (0.485, 0.456, 0.406)
IMAGENET_STD = (0.229, 0.224, 0.225)

def build_transform(input_size):
    MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
    transform = T.Compose([
        T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),
        T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
        T.ToTensor(),
        T.Normalize(mean=MEAN, std=STD)
    ])
    return transform

def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
    best_ratio_diff = float('inf')
    best_ratio = (1, 1)
    area = width * height
    for ratio in target_ratios:
        target_aspect_ratio = ratio[0] / ratio[1]
        ratio_diff = abs(aspect_ratio - target_aspect_ratio)
        if ratio_diff < best_ratio_diff:
            best_ratio_diff = ratio_diff
            best_ratio = ratio
        elif ratio_diff == best_ratio_diff:
            if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
                best_ratio = ratio
    return best_ratio

def dynamic_preprocess(image, min_num=1, max_num=12, image_size=448, use_thumbnail=False):
    orig_width, orig_height = image.size
    aspect_ratio = orig_width / orig_height

    # calculate the existing image aspect ratio
    target_ratios = set(
        (i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if
        i * j <= max_num and i * j >= min_num)
    target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])

    # find the closest aspect ratio to the target
    target_aspect_ratio = find_closest_aspect_ratio(
        aspect_ratio, target_ratios, orig_width, orig_height, image_size)

    # calculate the target width and height
    target_width = image_size * target_aspect_ratio[0]
    target_height = image_size * target_aspect_ratio[1]
    blocks = target_aspect_ratio[0] * target_aspect_ratio[1]

    # resize the image
    resized_img = image.resize((target_width, target_height))
    processed_images = []
    for i in range(blocks):
        box = (
            (i % (target_width // image_size)) * image_size,
            (i // (target_width // image_size)) * image_size,
            ((i % (target_width // image_size)) + 1) * image_size,
            ((i // (target_width // image_size)) + 1) * image_size
        )
        # split the image
        split_img = resized_img.crop(box)
        processed_images.append(split_img)
    assert len(processed_images) == blocks
    if use_thumbnail and len(processed_images) != 1:
        thumbnail_img = image.resize((image_size, image_size))
        processed_images.append(thumbnail_img)
    return processed_images

def load_image(image_file, input_size=448, max_num=12):
    image = Image.open(image_file).convert('RGB')
    transform = build_transform(input_size=input_size)
    images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
    pixel_values = [transform(image) for image in images]
    pixel_values = torch.stack(pixel_values)
    return pixel_values

model = AutoModel.from_pretrained(
    "5CD-AI/Vintern-3B-R-beta",
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    trust_remote_code=True,
    use_flash_attn=False,
).eval().cuda()

tokenizer = AutoTokenizer.from_pretrained("5CD-AI/Vintern-3B-R-beta", trust_remote_code=True, use_fast=False)

test_image = 'test-image.jpg'

think_prompt_format = """<image>\nBạn là người rất cẩn thận và đa nghi, vui lòng trả lời câu hỏi dưới đây bằng tiếng Việt. Khi suy luận bạn thường liệt kê ra các bằng chứng để chỉ ra các đáp án khả thi, suy luận và giải thích tại sao lại lựa chọn và loại bỏ trước khi đưa ra câu trả lời cuối cùng.
Câu hỏi:
{question_input}
Hãy trả lời rất dài theo định dạng sau:
<SUMMARY>...</SUMMARY>
<CAPTION>...</CAPTION>
<INFORMATION_EXTRACT>...</INFORMATION_EXTRACT>
<EXTERNAL_KNOWLEDGE_EXPANSION>...</EXTERNAL_KNOWLEDGE_EXPANSION>
<FIND_CANDIDATES_REASONING>...</FIND_CANDIDATES_REASONING>
<TOP3_CANDIDATES>...</TOP3_CANDIDATES>
<REASONING_PLAN>...</REASONING_PLAN>
<REASONING>...</REASONING>
<COUNTER_ARGUMENTS>...</COUNTER_ARGUMENTS>
<VALIDATION_REASONING>...</VALIDATION_REASONING>
<CONCLUSION>...</CONCLUSION>
"""

pixel_values = load_image(test_image, max_num=6).to(torch.bfloat16).cuda()
generation_config = dict(max_new_tokens= 1024, do_sample=False, num_beams = 3, repetition_penalty=2.5)

question = '<image>\nTrích xuất thông tin chính trong ảnh và trả về dạng markdown.'

response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=None, return_history=True)
print(f'User: {question}\nAssistant: {response}')

#question = "Câu hỏi khác ......"
#response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=history, return_history=True)
#print(f'User: {question}\nAssistant: {response}')

✨ 主要特性

多模态推理：能够基于图像进行长而复杂的推理，将每个推理步骤分解为多个子步骤，同时控制幻觉的产生。
性能优越：尽管在平衡多项任务和推理方面存在困难，但Vintern - 3B - R - beta在各种基准测试中都优于所有以前的版本。
不同版本适用场景不同：
- Vintern - 1B - v3_5：速度快⚡，适用于具有简单文本格式的越南语OCR，可靠性高✅。
- Vintern - 3B - R - beta：更适合处理复杂问题和复杂结构的文档图像🔍📚。由于训练重点在于推理，对模糊或不清晰文本的OCR性能可能会略有影响🔍🤖。

📦 安装指南

文档未提供安装步骤，故跳过此章节。

💻 使用示例

基础用法

以下是使用模型进行推理的基础代码示例：

import numpy as np
import torch
import torchvision.transforms as T
# from decord import VideoReader, cpu
from PIL import Image
from torchvision.transforms.functional import InterpolationMode
from transformers import AutoModel, AutoTokenizer

IMAGENET_MEAN = (0.485, 0.456, 0.406)
IMAGENET_STD = (0.229, 0.224, 0.225)

def build_transform(input_size):
    MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
    transform = T.Compose([
        T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),
        T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
        T.ToTensor(),
        T.Normalize(mean=MEAN, std=STD)
    ])
    return transform

def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
    best_ratio_diff = float('inf')
    best_ratio = (1, 1)
    area = width * height
    for ratio in target_ratios:
        target_aspect_ratio = ratio[0] / ratio[1]
        ratio_diff = abs(aspect_ratio - target_aspect_ratio)
        if ratio_diff < best_ratio_diff:
            best_ratio_diff = ratio_diff
            best_ratio = ratio
        elif ratio_diff == best_ratio_diff:
            if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
                best_ratio = ratio
    return best_ratio

def dynamic_preprocess(image, min_num=1, max_num=12, image_size=448, use_thumbnail=False):
    orig_width, orig_height = image.size
    aspect_ratio = orig_width / orig_height

    # calculate the existing image aspect ratio
    target_ratios = set(
        (i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if
        i * j <= max_num and i * j >= min_num)
    target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])

    # find the closest aspect ratio to the target
    target_aspect_ratio = find_closest_aspect_ratio(
        aspect_ratio, target_ratios, orig_width, orig_height, image_size)

    # calculate the target width and height
    target_width = image_size * target_aspect_ratio[0]
    target_height = image_size * target_aspect_ratio[1]
    blocks = target_aspect_ratio[0] * target_aspect_ratio[1]

    # resize the image
    resized_img = image.resize((target_width, target_height))
    processed_images = []
    for i in range(blocks):
        box = (
            (i % (target_width // image_size)) * image_size,
            (i // (target_width // image_size)) * image_size,
            ((i % (target_width // image_size)) + 1) * image_size,
            ((i // (target_width // image_size)) + 1) * image_size
        )
        # split the image
        split_img = resized_img.crop(box)
        processed_images.append(split_img)
    assert len(processed_images) == blocks
    if use_thumbnail and len(processed_images) != 1:
        thumbnail_img = image.resize((image_size, image_size))
        processed_images.append(thumbnail_img)
    return processed_images

def load_image(image_file, input_size=448, max_num=12):
    image = Image.open(image_file).convert('RGB')
    transform = build_transform(input_size=input_size)
    images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
    pixel_values = [transform(image) for image in images]
    pixel_values = torch.stack(pixel_values)
    return pixel_values

model = AutoModel.from_pretrained(
    "5CD-AI/Vintern-3B-R-beta",
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    trust_remote_code=True,
    use_flash_attn=False,
).eval().cuda()

tokenizer = AutoTokenizer.from_pretrained("5CD-AI/Vintern-3B-R-beta", trust_remote_code=True, use_fast=False)

test_image = 'test-image.jpg'

think_prompt_format = """<image>\nBạn là người rất cẩn thận và đa nghi, vui lòng trả lời câu hỏi dưới đây bằng tiếng Việt. Khi suy luận bạn thường liệt kê ra các bằng chứng để chỉ ra các đáp án khả thi, suy luận và giải thích tại sao lại lựa chọn và loại bỏ trước khi đưa ra câu trả lời cuối cùng.
Câu hỏi:
{question_input}
Hãy trả lời rất dài theo định dạng sau:
<SUMMARY>...</SUMMARY>
<CAPTION>...</CAPTION>
<INFORMATION_EXTRACT>...</INFORMATION_EXTRACT>
<EXTERNAL_KNOWLEDGE_EXPANSION>...</EXTERNAL_KNOWLEDGE_EXPANSION>
<FIND_CANDIDATES_REASONING>...</FIND_CANDIDATES_REASONING>
<TOP3_CANDIDATES>...</TOP3_CANDIDATES>
<REASONING_PLAN>...</REASONING_PLAN>
<REASONING>...</REASONING>
<COUNTER_ARGUMENTS>...</COUNTER_ARGUMENTS>
<VALIDATION_REASONING>...</VALIDATION_REASONING>
<CONCLUSION>...</CONCLUSION>
"""

pixel_values = load_image(test_image, max_num=6).to(torch.bfloat16).cuda()
generation_config = dict(max_new_tokens= 1024, do_sample=False, num_beams = 3, repetition_penalty=2.5)

question = '<image>\nTrích xuất thông tin chính trong ảnh và trả về dạng markdown.'

response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=None, return_history=True)
print(f'User: {question}\nAssistant: {response}')

#question = "Câu hỏi khác ......"
#response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=history, return_history=True)
#print(f'User: {question}\nAssistant: {response}')

高级用法

文档未提供高级用法的代码示例，故跳过此部分。

📚 详细文档

示例1

<SUMMARY>题目要求确定一家餐厅菜单上列出的所有菜品中价格最高的菜品。为了解决这个问题，需要分析菜单上每道菜品的价格并进行比较，以找出价格最高的菜品。</SUMMARY>  

<CAPTION>拍摄餐厅部分菜单的照片，显示了菜品列表及其对应的价格。菜单分为两列，每列列出了不同的菜品及其相应的价格。</CAPTION>  

<FIND_CANDIDATES_REASONING>为了找出价格最高的菜品，我们需要执行以下步骤：  

1. **确定菜品列表**：菜单提供了完整的菜品列表，包括：炒酸辣青蛙、炸咸鱼青蛙、烤辣椒鸡、炸咸鱼鸡、炒酸辣海螺、炒酸辣花甲、蒜香炒花甲、炒酸辣香螺、蒜香炒香螺、芥菜蟹饼、炒三丝鱼包、烤猪尾、炒酸辣野猪肉、炒香菜野猪肉、炒酸辣水牛肉、炒香菜水牛肉、包菜烤巴沙鱼（两条）、青椒烤巴沙鱼（两条）、姜叶鸡汤锅、姜叶青蛙汤锅、墨鱼汤锅、牛肉汤锅。  

2. **确定每道菜品的价格**：每道菜品的价格都清楚地标注在菜品名称旁边。例如，炒酸辣青蛙价格为55k，炸咸鱼青蛙价格为55k，烤辣椒鸡价格为50k，等等。  

3. **比较菜品价格**：比较每道菜品的价格，以找出价格最高的菜品。这个过程需要仔细，以确保不遗漏任何菜品。</FIND_CANDIDATES_REASONING>  

<TOP3_CANDIDATES>  
1. 包菜烤巴沙鱼（两条）65k  
2. 青椒烤巴沙鱼（两条）65k  
3. 牛肉汤锅69k  
</TOP3_CANDIDATES>  

<REASONING>为了确定价格最高的菜品，我们需要比较菜单上所有菜品的价格。根据图片中的信息，我们可以列出如下价格表：  

| 菜品 | 价格（越南盾） |  
|-------------------------------|--------|  
| 炒酸辣青蛙 | 55k |  
| 炸咸鱼青蛙 | 55k |  
| 烤辣椒鸡 | 50k |  
| 炸咸鱼鸡 | 50k |  
| 炒酸辣海螺 | 45k |  
| 炒酸辣花甲 | 49k |  
| 蒜香炒花甲 | 49k |  
| 炒酸辣香螺 | 59k |  
| 蒜香炒香螺 | 59k |  
| 芥菜蟹饼 | 55k |  
| 炒三丝鱼包 | 55k |  
| 烤猪尾 | 50k |  
| 炒酸辣野猪肉 | 55k |  
| 炒香菜野猪肉 | 55k |  
| 炒酸辣水牛肉 | 55k |  
| 炒香菜水牛肉 | 55k |  
| 包菜烤巴沙鱼（两条） | 65k |  
| 青椒烤巴沙鱼（两条） | 65k |  
| 姜叶鸡汤锅 | 59k |  
| 姜叶青蛙汤锅 | 65k |  
| 墨鱼汤锅 | 65k |  
| 牛肉汤锅 | 69k |  

比较后，价格最高的菜品是牛肉汤锅，价格为69k。</REASONING>  

<COUNTER_ARGUMENTS>反对者可能会提出以下论点：  

- **部分菜品价格较低**：部分菜品价格较低，例如炒酸辣青蛙、炸咸鱼青蛙、烤辣椒鸡、炸咸鱼鸡、炒酸辣海螺、炒酸辣花甲、蒜香炒花甲、炒酸辣香螺、蒜香炒香螺、芥菜蟹饼、炒三丝鱼包、烤猪尾、炒酸辣野猪肉、炒香菜野猪肉、炒酸辣水牛肉、炒香菜水牛肉、包菜烤巴沙鱼（两条）、青椒烤巴沙鱼（两条）、姜叶鸡汤锅、姜叶青蛙汤锅、墨鱼汤锅。  

- **价格可能会变化**：菜品的价格可能会根据时间和地点而变化。根据图片中的信息比较价格可能无法准确反映实际价格。</COUNTER_ARGUMENTS>  

<REASONING>反对论点主要集中在比较价格较低的菜品。然而，本题的目的是确定价格最高的菜品，而不是比较所有菜品的价格。我们已经比较了所有菜品的价格，并得出价格最高的菜品是牛肉汤锅，价格为69k。</REASONING>  

<CONCLUSION>牛肉汤锅</CONCLUSION>

示例1图片

示例2

示例2图片

用户：<image>
列出所有文本。
助手：政府
编号：626/QĐ - TTg
越南社会主义共和国
独立 - 自由 - 幸福
河内，2002年7月29日
政府总理决定
批准投资项目可行性研究报告
海防市鼎富经济区磷酸二铵（DAP）化肥厂建设项目
总理（政府）
根据2001年12月25日《政府组织法》；
根据政府1999年7月8日第52/1999/NĐ - CP号法令（《投资和建设管理条例》）和政府2000年5月5日第12/2000/ND - CP号法令（规定修改、补充《投资和建设管理条例》的若干条款）。
审议越南化工总公司的请示（2001年11月1日第916/CV - HĐQT号公文）、计划投资部的评估意见（2002年4月1日第1944/BKH/VPTD号公文），关于批准海防市鼎富经济区磷酸二铵化肥厂投资项目可行性研究报告以及各部委在2002年6月19日会议上的意见，
决定：
第1条。批准海防市鼎富经济区磷酸二铵化肥厂投资项目可行性研究报告，主要内容如下：
1. 项目名称：海防市鼎富经济区磷酸二铵化肥厂。
2. 投资目标：有助于确保稳定和主动供应磷酸二铵化肥，以促进农业发展，减少进口，更有效地利用国内磷灰石资源。
3. 投资者：越南化工总公司。

🔧 技术细节

文档未提供技术实现细节，故跳过此章节。

📄 许可证

本项目采用MIT许可证。

📚 引用

@misc{doan2024vintern1befficientmultimodallarge,
      title={Vintern-1B: An Efficient Multimodal Large Language Model for Vietnamese}, 
      author={Khang T. Doan and Bao G. Huynh and Dung T. Hoang and Thuc D. Pham and Nhat H. Pham and Quan T. M. Nguyen and Bang Q. Vo and Suong N. Hoang},
      year={2024},
      eprint={2408.12480},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2408.12480}, 
}

📚 参考

[1] Z. Chen et al., ‘Expanding performance boundaries of open - source multimodal models with model, data, and test - time scaling’, arXiv preprint arXiv:2412. 05271, 2024.