Vintern-3B-R-beta開源多模態模型 - 免費部署助力圖像複雜推理任務

首頁

Vintern 3B R Beta

由5CD-AI開發

Vintern-3B-R-beta是一個多模態大語言模型，專注於基於圖像的複雜推理任務，能分解推理步驟並有效控制幻覺現象。

圖像生成文本

Transformers

支持多種語言開源協議:MIT #多模態推理 #越南語OCR #結構化文檔解析

下載量 1,841

發布時間 : 3/19/2025

模型概述

該模型結合了視覺和語言處理能力，擅長處理結構化文檔圖像和複雜問題推理，支持越南語、英語和中文。

模型特點

複雜推理能力

能夠基於圖像進行長鏈條複雜推理，將推理步驟分解為多個子步驟

多語言支持

支持越南語、英語和中文三種語言處理

幻覺控制

在推理過程中有效控制幻覺現象的產生

多模態處理

結合視覺和語言處理能力，處理結構化文檔圖像

模型能力

圖像理解

複雜推理

多語言文本生成

結構化文檔處理

OCR文本提取

使用案例

餐飲行業

菜單價格分析

從餐廳菜單圖像中提取菜品信息並比較價格

準確識別最高價菜品

政府文檔處理

公文文本提取

從政府公文圖像中提取完整文本內容

準確提取越南語政府公文內容

🚀 Vintern推理模型

Vintern推理模型是一個多模態大語言模型與推理模型的結合體，它能夠基於圖像進行長而複雜的推理，將每個推理步驟分解為多個子步驟，同時控制幻覺的產生。該模型在多種基準測試中表現出色，為越南語OCR和複雜問題解決提供了強大的支持。

🚀 快速開始

這裡提供了一段代碼片段，展示如何加載分詞器和模型，以及如何生成內容。要使用該模型進行推理，請按照我們Colab推理筆記本中概述的步驟操作。

import numpy as np
import torch
import torchvision.transforms as T
# from decord import VideoReader, cpu
from PIL import Image
from torchvision.transforms.functional import InterpolationMode
from transformers import AutoModel, AutoTokenizer

IMAGENET_MEAN = (0.485, 0.456, 0.406)
IMAGENET_STD = (0.229, 0.224, 0.225)

def build_transform(input_size):
    MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
    transform = T.Compose([
        T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),
        T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
        T.ToTensor(),
        T.Normalize(mean=MEAN, std=STD)
    ])
    return transform

def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
    best_ratio_diff = float('inf')
    best_ratio = (1, 1)
    area = width * height
    for ratio in target_ratios:
        target_aspect_ratio = ratio[0] / ratio[1]
        ratio_diff = abs(aspect_ratio - target_aspect_ratio)
        if ratio_diff < best_ratio_diff:
            best_ratio_diff = ratio_diff
            best_ratio = ratio
        elif ratio_diff == best_ratio_diff:
            if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
                best_ratio = ratio
    return best_ratio

def dynamic_preprocess(image, min_num=1, max_num=12, image_size=448, use_thumbnail=False):
    orig_width, orig_height = image.size
    aspect_ratio = orig_width / orig_height

    # calculate the existing image aspect ratio
    target_ratios = set(
        (i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if
        i * j <= max_num and i * j >= min_num)
    target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])

    # find the closest aspect ratio to the target
    target_aspect_ratio = find_closest_aspect_ratio(
        aspect_ratio, target_ratios, orig_width, orig_height, image_size)

    # calculate the target width and height
    target_width = image_size * target_aspect_ratio[0]
    target_height = image_size * target_aspect_ratio[1]
    blocks = target_aspect_ratio[0] * target_aspect_ratio[1]

    # resize the image
    resized_img = image.resize((target_width, target_height))
    processed_images = []
    for i in range(blocks):
        box = (
            (i % (target_width // image_size)) * image_size,
            (i // (target_width // image_size)) * image_size,
            ((i % (target_width // image_size)) + 1) * image_size,
            ((i // (target_width // image_size)) + 1) * image_size
        )
        # split the image
        split_img = resized_img.crop(box)
        processed_images.append(split_img)
    assert len(processed_images) == blocks
    if use_thumbnail and len(processed_images) != 1:
        thumbnail_img = image.resize((image_size, image_size))
        processed_images.append(thumbnail_img)
    return processed_images

def load_image(image_file, input_size=448, max_num=12):
    image = Image.open(image_file).convert('RGB')
    transform = build_transform(input_size=input_size)
    images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
    pixel_values = [transform(image) for image in images]
    pixel_values = torch.stack(pixel_values)
    return pixel_values

model = AutoModel.from_pretrained(
    "5CD-AI/Vintern-3B-R-beta",
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    trust_remote_code=True,
    use_flash_attn=False,
).eval().cuda()

tokenizer = AutoTokenizer.from_pretrained("5CD-AI/Vintern-3B-R-beta", trust_remote_code=True, use_fast=False)

test_image = 'test-image.jpg'

think_prompt_format = """<image>\nBạn là người rất cẩn thận và đa nghi, vui lòng trả lời câu hỏi dưới đây bằng tiếng Việt. Khi suy luận bạn thường liệt kê ra các bằng chứng để chỉ ra các đáp án khả thi, suy luận và giải thích tại sao lại lựa chọn và loại bỏ trước khi đưa ra câu trả lời cuối cùng.
Câu hỏi:
{question_input}
Hãy trả lời rất dài theo định dạng sau:
<SUMMARY>...</SUMMARY>
<CAPTION>...</CAPTION>
<INFORMATION_EXTRACT>...</INFORMATION_EXTRACT>
<EXTERNAL_KNOWLEDGE_EXPANSION>...</EXTERNAL_KNOWLEDGE_EXPANSION>
<FIND_CANDIDATES_REASONING>...</FIND_CANDIDATES_REASONING>
<TOP3_CANDIDATES>...</TOP3_CANDIDATES>
<REASONING_PLAN>...</REASONING_PLAN>
<REASONING>...</REASONING>
<COUNTER_ARGUMENTS>...</COUNTER_ARGUMENTS>
<VALIDATION_REASONING>...</VALIDATION_REASONING>
<CONCLUSION>...</CONCLUSION>
"""

pixel_values = load_image(test_image, max_num=6).to(torch.bfloat16).cuda()
generation_config = dict(max_new_tokens= 1024, do_sample=False, num_beams = 3, repetition_penalty=2.5)

question = '<image>\nTrích xuất thông tin chính trong ảnh và trả về dạng markdown.'

response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=None, return_history=True)
print(f'User: {question}\nAssistant: {response}')

#question = "Câu hỏi khác ......"
#response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=history, return_history=True)
#print(f'User: {question}\nAssistant: {response}')

✨ 主要特性

多模態推理：能夠基於圖像進行長而複雜的推理，將每個推理步驟分解為多個子步驟，同時控制幻覺的產生。
性能優越：儘管在平衡多項任務和推理方面存在困難，但Vintern - 3B - R - beta在各種基準測試中都優於所有以前的版本。
不同版本適用場景不同：
- Vintern - 1B - v3_5：速度快⚡，適用於具有簡單文本格式的越南語OCR，可靠性高✅。
- Vintern - 3B - R - beta：更適合處理複雜問題和複雜結構的文檔圖像🔍📚。由於訓練重點在於推理，對模糊或不清晰文本的OCR性能可能會略有影響🔍🤖。

📦 安裝指南

文檔未提供安裝步驟，故跳過此章節。

💻 使用示例

基礎用法

以下是使用模型進行推理的基礎代碼示例：

import numpy as np
import torch
import torchvision.transforms as T
# from decord import VideoReader, cpu
from PIL import Image
from torchvision.transforms.functional import InterpolationMode
from transformers import AutoModel, AutoTokenizer

IMAGENET_MEAN = (0.485, 0.456, 0.406)
IMAGENET_STD = (0.229, 0.224, 0.225)

def build_transform(input_size):
    MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
    transform = T.Compose([
        T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),
        T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
        T.ToTensor(),
        T.Normalize(mean=MEAN, std=STD)
    ])
    return transform

def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
    best_ratio_diff = float('inf')
    best_ratio = (1, 1)
    area = width * height
    for ratio in target_ratios:
        target_aspect_ratio = ratio[0] / ratio[1]
        ratio_diff = abs(aspect_ratio - target_aspect_ratio)
        if ratio_diff < best_ratio_diff:
            best_ratio_diff = ratio_diff
            best_ratio = ratio
        elif ratio_diff == best_ratio_diff:
            if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
                best_ratio = ratio
    return best_ratio

def dynamic_preprocess(image, min_num=1, max_num=12, image_size=448, use_thumbnail=False):
    orig_width, orig_height = image.size
    aspect_ratio = orig_width / orig_height

    # calculate the existing image aspect ratio
    target_ratios = set(
        (i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if
        i * j <= max_num and i * j >= min_num)
    target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])

    # find the closest aspect ratio to the target
    target_aspect_ratio = find_closest_aspect_ratio(
        aspect_ratio, target_ratios, orig_width, orig_height, image_size)

    # calculate the target width and height
    target_width = image_size * target_aspect_ratio[0]
    target_height = image_size * target_aspect_ratio[1]
    blocks = target_aspect_ratio[0] * target_aspect_ratio[1]

    # resize the image
    resized_img = image.resize((target_width, target_height))
    processed_images = []
    for i in range(blocks):
        box = (
            (i % (target_width // image_size)) * image_size,
            (i // (target_width // image_size)) * image_size,
            ((i % (target_width // image_size)) + 1) * image_size,
            ((i // (target_width // image_size)) + 1) * image_size
        )
        # split the image
        split_img = resized_img.crop(box)
        processed_images.append(split_img)
    assert len(processed_images) == blocks
    if use_thumbnail and len(processed_images) != 1:
        thumbnail_img = image.resize((image_size, image_size))
        processed_images.append(thumbnail_img)
    return processed_images

def load_image(image_file, input_size=448, max_num=12):
    image = Image.open(image_file).convert('RGB')
    transform = build_transform(input_size=input_size)
    images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
    pixel_values = [transform(image) for image in images]
    pixel_values = torch.stack(pixel_values)
    return pixel_values

model = AutoModel.from_pretrained(
    "5CD-AI/Vintern-3B-R-beta",
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    trust_remote_code=True,
    use_flash_attn=False,
).eval().cuda()

tokenizer = AutoTokenizer.from_pretrained("5CD-AI/Vintern-3B-R-beta", trust_remote_code=True, use_fast=False)

test_image = 'test-image.jpg'

think_prompt_format = """<image>\nBạn là người rất cẩn thận và đa nghi, vui lòng trả lời câu hỏi dưới đây bằng tiếng Việt. Khi suy luận bạn thường liệt kê ra các bằng chứng để chỉ ra các đáp án khả thi, suy luận và giải thích tại sao lại lựa chọn và loại bỏ trước khi đưa ra câu trả lời cuối cùng.
Câu hỏi:
{question_input}
Hãy trả lời rất dài theo định dạng sau:
<SUMMARY>...</SUMMARY>
<CAPTION>...</CAPTION>
<INFORMATION_EXTRACT>...</INFORMATION_EXTRACT>
<EXTERNAL_KNOWLEDGE_EXPANSION>...</EXTERNAL_KNOWLEDGE_EXPANSION>
<FIND_CANDIDATES_REASONING>...</FIND_CANDIDATES_REASONING>
<TOP3_CANDIDATES>...</TOP3_CANDIDATES>
<REASONING_PLAN>...</REASONING_PLAN>
<REASONING>...</REASONING>
<COUNTER_ARGUMENTS>...</COUNTER_ARGUMENTS>
<VALIDATION_REASONING>...</VALIDATION_REASONING>
<CONCLUSION>...</CONCLUSION>
"""

pixel_values = load_image(test_image, max_num=6).to(torch.bfloat16).cuda()
generation_config = dict(max_new_tokens= 1024, do_sample=False, num_beams = 3, repetition_penalty=2.5)

question = '<image>\nTrích xuất thông tin chính trong ảnh và trả về dạng markdown.'

response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=None, return_history=True)
print(f'User: {question}\nAssistant: {response}')

#question = "Câu hỏi khác ......"
#response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=history, return_history=True)
#print(f'User: {question}\nAssistant: {response}')

高級用法

文檔未提供高級用法的代碼示例，故跳過此部分。

📚 詳細文檔

示例1

<SUMMARY>題目要求確定一家餐廳菜單上列出的所有菜品中價格最高的菜品。為了解決這個問題，需要分析菜單上每道菜品的價格並進行比較，以找出價格最高的菜品。</SUMMARY>  

<CAPTION>拍攝餐廳部分菜單的照片，顯示了菜品列表及其對應的價格。菜單分為兩列，每列列出了不同的菜品及其相應的價格。</CAPTION>  

<FIND_CANDIDATES_REASONING>為了找出價格最高的菜品，我們需要執行以下步驟：  

1. **確定菜品列表**：菜單提供了完整的菜品列表，包括：炒酸辣青蛙、炸鹹魚青蛙、烤辣椒雞、炸鹹魚雞、炒酸辣海螺、炒酸辣花甲、蒜香炒花甲、炒酸辣香螺、蒜香炒香螺、芥菜蟹餅、炒三絲魚包、烤豬尾、炒酸辣野豬肉、炒香菜野豬肉、炒酸辣水牛肉、炒香菜水牛肉、包菜烤巴沙魚（兩條）、青椒烤巴沙魚（兩條）、姜葉雞湯鍋、姜葉青蛙湯鍋、墨魚湯鍋、牛肉湯鍋。  

2. **確定每道菜品的價格**：每道菜品的價格都清楚地標註在菜品名稱旁邊。例如，炒酸辣青蛙價格為55k，炸鹹魚青蛙價格為55k，烤辣椒雞價格為50k，等等。  

3. **比較菜品價格**：比較每道菜品的價格，以找出價格最高的菜品。這個過程需要仔細，以確保不遺漏任何菜品。</FIND_CANDIDATES_REASONING>  

<TOP3_CANDIDATES>  
1. 包菜烤巴沙魚（兩條）65k  
2. 青椒烤巴沙魚（兩條）65k  
3. 牛肉湯鍋69k  
</TOP3_CANDIDATES>  

<REASONING>為了確定價格最高的菜品，我們需要比較菜單上所有菜品的價格。根據圖片中的信息，我們可以列出如下價格表：  

| 菜品 | 價格（越南盾） |  
|-------------------------------|--------|  
| 炒酸辣青蛙 | 55k |  
| 炸鹹魚青蛙 | 55k |  
| 烤辣椒雞 | 50k |  
| 炸鹹魚雞 | 50k |  
| 炒酸辣海螺 | 45k |  
| 炒酸辣花甲 | 49k |  
| 蒜香炒花甲 | 49k |  
| 炒酸辣香螺 | 59k |  
| 蒜香炒香螺 | 59k |  
| 芥菜蟹餅 | 55k |  
| 炒三絲魚包 | 55k |  
| 烤豬尾 | 50k |  
| 炒酸辣野豬肉 | 55k |  
| 炒香菜野豬肉 | 55k |  
| 炒酸辣水牛肉 | 55k |  
| 炒香菜水牛肉 | 55k |  
| 包菜烤巴沙魚（兩條） | 65k |  
| 青椒烤巴沙魚（兩條） | 65k |  
| 姜葉雞湯鍋 | 59k |  
| 姜葉青蛙湯鍋 | 65k |  
| 墨魚湯鍋 | 65k |  
| 牛肉湯鍋 | 69k |  

比較後，價格最高的菜品是牛肉湯鍋，價格為69k。</REASONING>  

<COUNTER_ARGUMENTS>反對者可能會提出以下論點：  

- **部分菜品價格較低**：部分菜品價格較低，例如炒酸辣青蛙、炸鹹魚青蛙、烤辣椒雞、炸鹹魚雞、炒酸辣海螺、炒酸辣花甲、蒜香炒花甲、炒酸辣香螺、蒜香炒香螺、芥菜蟹餅、炒三絲魚包、烤豬尾、炒酸辣野豬肉、炒香菜野豬肉、炒酸辣水牛肉、炒香菜水牛肉、包菜烤巴沙魚（兩條）、青椒烤巴沙魚（兩條）、姜葉雞湯鍋、姜葉青蛙湯鍋、墨魚湯鍋。  

- **價格可能會變化**：菜品的價格可能會根據時間和地點而變化。根據圖片中的信息比較價格可能無法準確反映實際價格。</COUNTER_ARGUMENTS>  

<REASONING>反對論點主要集中在比較價格較低的菜品。然而，本題的目的是確定價格最高的菜品，而不是比較所有菜品的價格。我們已經比較了所有菜品的價格，並得出價格最高的菜品是牛肉湯鍋，價格為69k。</REASONING>  

<CONCLUSION>牛肉湯鍋</CONCLUSION>

示例1圖片

示例2

示例2圖片

用戶：<image>
列出所有文本。
助手：政府
編號：626/QĐ - TTg
越南社會主義共和國
獨立 - 自由 - 幸福
河內，2002年7月29日
政府總理決定
批准投資項目可行性研究報告
海防市鼎富經濟區磷酸二銨（DAP）化肥廠建設項目
總理（政府）
根據2001年12月25日《政府組織法》；
根據政府1999年7月8日第52/1999/NĐ - CP號法令（《投資和建設管理條例》）和政府2000年5月5日第12/2000/ND - CP號法令（規定修改、補充《投資和建設管理條例》的若干條款）。
審議越南化工總公司的請示（2001年11月1日第916/CV - HĐQT號公文）、計劃投資部的評估意見（2002年4月1日第1944/BKH/VPTD號公文），關於批准海防市鼎富經濟區磷酸二銨化肥廠投資項目可行性研究報告以及各部委在2002年6月19日會議上的意見，
決定：
第1條。批准海防市鼎富經濟區磷酸二銨化肥廠投資項目可行性研究報告，主要內容如下：
1. 項目名稱：海防市鼎富經濟區磷酸二銨化肥廠。
2. 投資目標：有助於確保穩定和主動供應磷酸二銨化肥，以促進農業發展，減少進口，更有效地利用國內磷灰石資源。
3. 投資者：越南化工總公司。

🔧 技術細節

文檔未提供技術實現細節，故跳過此章節。

📄 許可證

本項目採用MIT許可證。

📚 引用

@misc{doan2024vintern1befficientmultimodallarge,
      title={Vintern-1B: An Efficient Multimodal Large Language Model for Vietnamese}, 
      author={Khang T. Doan and Bao G. Huynh and Dung T. Hoang and Thuc D. Pham and Nhat H. Pham and Quan T. M. Nguyen and Bang Q. Vo and Suong N. Hoang},
      year={2024},
      eprint={2408.12480},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2408.12480}, 
}

📚 參考

[1] Z. Chen et al., ‘Expanding performance boundaries of open - source multimodal models with model, data, and test - time scaling’, arXiv preprint arXiv:2412. 05271, 2024.