Vintern-3B-R-beta Open-Source Multi-Modal Model - Free Deployment to Assist with Complex Image Reasoning Tasks

Vintern 3B R Beta

Developed by 5CD-AI

Vintern-3B-R-beta is a multimodal large language model focused on complex reasoning tasks based on images, capable of decomposing reasoning steps and effectively controlling hallucination phenomena.

Image-to-Text

Transformers

Supports Multiple LanguagesOpen Source License:MIT #Multimodal Reasoning #Vietnamese OCR #Structured Document Parsing

Downloads 1,841

Release Time : 3/19/2025

Model Overview

This model combines visual and language processing capabilities, excelling in handling structured document images and complex problem reasoning, supporting Vietnamese, English, and Chinese.

Model Features

Complex Reasoning Capability

Capable of performing long-chain complex reasoning based on images, breaking down reasoning steps into multiple sub-steps

Multilingual Support

Supports processing in three languages: Vietnamese, English, and Chinese

Hallucination Control

Effectively controls the occurrence of hallucination phenomena during reasoning

Multimodal Processing

Combines visual and language processing capabilities to handle structured document images

Model Capabilities

Image Understanding

Complex Reasoning

Multilingual Text Generation

Structured Document Processing

OCR Text Extraction

Use Cases

Food & Beverage Industry

Menu Price Analysis

Extracts dish information from restaurant menu images and compares prices

Accurately identifies the highest-priced dish

Government Document Processing

Official Document Text Extraction

Extracts complete text content from government document images

Accurately extracts content from Vietnamese government documents

🚀 Vintern Reasoning Model

This is a multimodal large language model combined with a reasoning model. It can perform long - and complex - reasoning based on images, breaking down each reasoning step into multiple sub - steps while keeping hallucinations under control.

image/png

After more than six months since creating the [5CD - AI/LLaVA - CoT - o1 - Instruct](https://huggingface.co/datasets/5CD - AI/LLaVA - CoT - o1 - Instruct) dataset—one of Hugging Face’s most liked datasets of 2024 🎉—we have just completed the "base" version of the Vintern Reasoning Model!

This model can perform long and complex reasoning based on images, breaking down each reasoning step into multiple sub - steps while keeping hallucinations under control.
Despite the difficulty of balancing multiple tasks alongside reasoning, Vintern - 3B - R - beta has outperformed all previous versions across various benchmarks!

When should you choose [Vintern - 1B - v3_5](https://huggingface.co/5CD - AI/Vintern - 1B - v3_5) vs Vintern - 3B - R - beta? 🤔

Vintern - 1B - v3_5: Fast ⚡ and good for Vietnamese OCR with simple text formatting. 📝 Highly reliable. ✅
Vintern - 3B - R - beta: Better for complex questions and complex structured doc image. 🔍📚 OCR performance on blurred or unclear text may be slightly affected due to our training focus on reasoning. 🔍🤖

🚀 The next step? Training and enhancing its reasoning ability by Reinforcement Learning!

📈 Benchmarks

image/png

💻 Usage Examples

Basic Usage

Example 1

<SUMMARY>The problem requires determining the most expensive dish on the menu of a restaurant. To solve this problem, we need to analyze the price of each dish from the menu and compare them to find the most expensive one.</SUMMARY>  

<CAPTION>The image shows a part of the restaurant's menu, listing various dishes with their prices. The menu is divided into two columns, each listing different dishes with corresponding prices.</CAPTION>  

<FIND_CANDIDATES_REASONING>To find the most expensive dish, we need to perform the following steps:  

1. **Identify the list of dishes**: The menu provides a complete list of dishes, including: Stir - fried frog with lemongrass and chili, Deep - fried frog with fish sauce, Grilled chicken with chili, Deep - fried chicken with fish sauce, Stir - fried whelk with lemongrass and chili, Stir - fried scallop with lemongrass and chili, Stir - fried scallop with garlic butter, Stir - fried fragrant whelk with lemongrass and chili, Stir - fried fragrant whelk with garlic butter, Crab cake with water spinach, Stir - fried baozi with basa fish, Grilled pork fingers with chili, Stir - fried wild boar with lemongrass and chili, Stir - fried wild boar with coriander, Stir - fried beef with lemongrass and chili, Stir - fried beef with coriander, Grilled snakehead fish wrapped in lettuce (2 pieces), Grilled snakehead fish with green chili (2 pieces), Chicken soup with giang leaves, Frog soup with giang leaves, Squid soup, Beef soup.  

2. **Determine the price of each dish**: The price of each dish is clearly written next to the dish name. For example, stir - fried frog with lemongrass and chili costs 55k, deep - fried frog with fish sauce costs 55k, grilled chicken with chili costs 50k, etc.  

3. **Compare the prices of dishes**: Compare the prices of each dish to find the most expensive one. This process requires carefulness to ensure that no dish is overlooked.</FIND_CANDIDATES_REASONING>  

<TOP3_CANDIDATES>  
1. Grilled snakehead fish wrapped in lettuce (2 pieces) 65k  
2. Grilled snakehead fish with green chili (2 pieces) 65k  
3. Beef soup 69k  
</TOP3_CANDIDATES>  

<REASONING>To determine the most expensive dish, we need to compare the prices of all dishes on the menu. Based on the information in the image, we can create the following price table:  

| Dish | Price (VND) |  
|-------------------------------|--------|  
| Stir - fried frog with lemongrass and chili | 55k |  
| Deep - fried frog with fish sauce | 55k |  
| Grilled chicken with chili | 50k |  
| Deep - fried chicken with fish sauce | 50k |  
| Stir - fried whelk with lemongrass and chili | 45k |  
| Stir - fried scallop with lemongrass and chili | 49k |  
| Stir - fried scallop with garlic butter | 49k |  
| Stir - fried fragrant whelk with lemongrass and chili | 59k |  
| Stir - fried fragrant whelk with garlic butter | 59k |  
| Crab cake with water spinach | 55k |  
| Stir - fried baozi with basa fish | 55k |  
| Grilled pork fingers with chili | 50k |  
| Stir - fried wild boar with lemongrass and chili | 55k |  
| Stir - fried wild boar with coriander | 55k |  
| Stir - fried beef with lemongrass and chili | 55k |  
| Stir - fried beef with coriander | 55k |  
| Grilled snakehead fish wrapped in lettuce (2 pieces) | 65k |  
| Grilled snakehead fish with green chili (2 pieces) | 65k |  
| Chicken soup with giang leaves | 59k |  
| Frog soup with giang leaves | 65k |  
| Squid soup | 65k |  
| Beef soup | 69k |  

After comparison, the most expensive dish is Beef soup with a price of 69k.</REASONING>  

<COUNTER_ARGUMENTS>Those who oppose may present the following arguments:  

- **Some dishes are cheaper**: Some dishes are cheaper, such as stir - fried frog with lemongrass and chili, deep - fried frog with fish sauce, grilled chicken with chili, deep - fried chicken with fish sauce, stir - fried whelk with lemongrass and chili, stir - fried scallop with lemongrass and chili, stir - fried scallop with garlic butter, stir - fried fragrant whelk with lemongrass and chili, stir - fried fragrant whelk with garlic butter, crab cake with water spinach, stir - fried baozi with basa fish, grilled pork fingers with chili, stir - fried wild boar with lemongrass and chili, stir - fried wild boar with coriander, stir - fried beef with lemongrass and chili, stir - fried beef with coriander, grilled snakehead fish wrapped in lettuce (2 pieces), grilled snakehead fish with green chili (2 pieces), chicken soup with giang leaves, frog soup with giang leaves, squid soup.  

- **Prices may change**: The prices of dishes may change depending on the time and location. Comparing prices based on the information in the image may not accurately reflect the actual prices.</COUNTER_ARGUMENTS>  

<REASONING>The opposing arguments focus on comparing the prices of cheaper dishes. However, the purpose of the problem is to determine the most expensive dish, not to compare the prices of all dishes. We have compared the prices of all dishes and concluded that the most expensive dish is Beef soup with a price of 69k.</REASONING>  

<CONCLUSION>Beef soup</CONCLUSION>

Example 2

User: <image>
List all the text.
Assistant: GOVERNMENT
No.: 626/QĐ - TTg
SOCIALIST REPUBLIC OF VIET NAM
Independence - Freedom - Happiness
Hanoi, July 29, 2002
DECISION OF THE PRIME MINISTER
Approving the Feasibility Study Report of the Investment Project
Factory for the production of diammonium phosphate fertilizer (DAP) in the
Dinh Vu Economic Zone, Hai Phong City
PRIME MINISTER (GOVERNMENT)
Based on the Government Organization Law of December 25, 2001;
Based on Decree No. 52/1999/NĐ - CP of July 8, 1999 of the Government
(Regulations on investment management, construction, and the Digital Industry No. 12/2000/ND - CP
of May 5, 2000) of the Government, which stipulates amendments and supplements to some
articles of the Regulations on investment management and construction.
Considering the proposal of the Vietnam Chemical Corporation (Document No. 916/CV - HĐQT
of November 1, 2001), the appraisal opinion of the Ministry of Planning and Investment in
Document No. 1944 BKH/VPTD of April 1, 2002, regarding the approval of
the Feasibility Study Report of the Investment Project for a Factory for the production of diammonium
phosphate fertilizer in the Dinh Vu Economic Zone, Hai Phong City, and the opinions of relevant
ministries and sectors at the meeting on June 19, 2002,
DECIDES:
Article 1. Approve the Feasibility Study Report of the Investment Project for a Factory for the
production of DAP fertilizer in the Dinh Vu Economic Zone, Hai Phong City, with the following main
contents:
1. Project name: Factory for the production of DAP fertilizer in the Dinh Vu Economic Zone,
Hai Phong City.
2. Investment objective: To contribute to ensuring stability and initiative in the supply of DAP
fertilizer for agricultural development, limit imports, and make more effective use of domestic apatite
mineral resources.
3. Investor: Vietnam Chemical Corporation.

Advanced Usage

import numpy as np
import torch
import torchvision.transforms as T
# from decord import VideoReader, cpu
from PIL import Image
from torchvision.transforms.functional import InterpolationMode
from transformers import AutoModel, AutoTokenizer

IMAGENET_MEAN = (0.485, 0.456, 0.406)
IMAGENET_STD = (0.229, 0.224, 0.225)

def build_transform(input_size):
    MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
    transform = T.Compose([
        T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),
        T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
        T.ToTensor(),
        T.Normalize(mean=MEAN, std=STD)
    ])
    return transform

def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
    best_ratio_diff = float('inf')
    best_ratio = (1, 1)
    area = width * height
    for ratio in target_ratios:
        target_aspect_ratio = ratio[0] / ratio[1]
        ratio_diff = abs(aspect_ratio - target_aspect_ratio)
        if ratio_diff < best_ratio_diff:
            best_ratio_diff = ratio_diff
            best_ratio = ratio
        elif ratio_diff == best_ratio_diff:
            if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
                best_ratio = ratio
    return best_ratio

def dynamic_preprocess(image, min_num=1, max_num=12, image_size=448, use_thumbnail=False):
    orig_width, orig_height = image.size
    aspect_ratio = orig_width / orig_height

    # calculate the existing image aspect ratio
    target_ratios = set(
        (i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if
        i * j <= max_num and i * j >= min_num)
    target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])

    # find the closest aspect ratio to the target
    target_aspect_ratio = find_closest_aspect_ratio(
        aspect_ratio, target_ratios, orig_width, orig_height, image_size)

    # calculate the target width and height
    target_width = image_size * target_aspect_ratio[0]
    target_height = image_size * target_aspect_ratio[1]
    blocks = target_aspect_ratio[0] * target_aspect_ratio[1]

    # resize the image
    resized_img = image.resize((target_width, target_height))
    processed_images = []
    for i in range(blocks):
        box = (
            (i % (target_width // image_size)) * image_size,
            (i // (target_width // image_size)) * image_size,
            ((i % (target_width // image_size)) + 1) * image_size,
            ((i // (target_width // image_size)) + 1) * image_size
        )
        # split the image
        split_img = resized_img.crop(box)
        processed_images.append(split_img)
    assert len(processed_images) == blocks
    if use_thumbnail and len(processed_images) != 1:
        thumbnail_img = image.resize((image_size, image_size))
        processed_images.append(thumbnail_img)
    return processed_images

def load_image(image_file, input_size=448, max_num=12):
    image = Image.open(image_file).convert('RGB')
    transform = build_transform(input_size=input_size)
    images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
    pixel_values = [transform(image) for image in images]
    pixel_values = torch.stack(pixel_values)
    return pixel_values

model = AutoModel.from_pretrained(
    "5CD - AI/Vintern - 3B - R - beta",
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    trust_remote_code=True,
    use_flash_attn=False,
).eval().cuda()

tokenizer = AutoTokenizer.from_pretrained("5CD - AI/Vintern - 3B - R - beta", trust_remote_code=True, use_fast=False)

test_image = 'test - image.jpg'

think_prompt_format = """<image>\nYou are a very cautious and skeptical person. Please answer the following question in Vietnamese. When reasoning, you usually list evidence to point out possible answers, reason, and explain why you choose and eliminate before giving the final answer.
Question:
{question_input}
Please answer at length in the following format:
<SUMMARY>...</SUMMARY>
<CAPTION>...</CAPTION>
<INFORMATION_EXTRACT>...</INFORMATION_EXTRACT>
<EXTERNAL_KNOWLEDGE_EXPANSION>...</EXTERNAL_KNOWLEDGE_EXPANSION>
<FIND_CANDIDATES_REASONING>...</FIND_CANDIDATES_REASONING>
<TOP3_CANDIDATES>...</TOP3_CANDIDATES>
<REASONING_PLAN>...</REASONING_PLAN>
<REASONING>...</REASONING>
<COUNTER_ARGUMENTS>...</COUNTER_ARGUMENTS>
<VALIDATION_REASONING>...</VALIDATION_REASONING>
<CONCLUSION>...</CONCLUSION>
"""

pixel_values = load_image(test_image, max_num=6).to(torch.bfloat16).cuda()
generation_config = dict(max_new_tokens= 1024, do_sample=False, num_beams = 3, repetition_penalty=2.5)

question = '<image>\nExtract the main information in the image and return it in markdown format.'

response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=None, return_history=True)
print(f'User: {question}\nAssistant: {response}')

#question = "Another question ......"
#response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=history, return_history=True)
#print(f'User: {question}\nAssistant: {response}')

📄 License

This project is under the MIT license.

📚 Citation

@misc{doan2024vintern1befficientmultimodallarge,
      title={Vintern-1B: An Efficient Multimodal Large Language Model for Vietnamese}, 
      author={Khang T. Doan and Bao G. Huynh and Dung T. Hoang and Thuc D. Pham and Nhat H. Pham and Quan T. M. Nguyen and Bang Q. Vo and Suong N. Hoang},
      year={2024},
      eprint={2408.12480},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2408.12480}, 
}

📚 Reference

[1] Z. Chen et al., ‘Expanding performance boundaries of open - source multimodal models with model, data, and test - time scaling’, arXiv preprint arXiv:2412. 05271, 2024.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご