
Model Overview
Model Features
Model Capabilities
Use Cases
🚀 Vintern Reasoning Model
This is a multimodal large language model combined with a reasoning model. It can perform long - and complex - reasoning based on images, breaking down each reasoning step into multiple sub - steps while keeping hallucinations under control.
After more than six months since creating the [5CD - AI/LLaVA - CoT - o1 - Instruct](https://huggingface.co/datasets/5CD - AI/LLaVA - CoT - o1 - Instruct) dataset—one of Hugging Face’s most liked datasets of 2024 🎉—we have just completed the "base" version of the Vintern Reasoning Model!
- This model can perform long and complex reasoning based on images, breaking down each reasoning step into multiple sub - steps while keeping hallucinations under control.
- Despite the difficulty of balancing multiple tasks alongside reasoning, Vintern - 3B - R - beta has outperformed all previous versions across various benchmarks!
When should you choose [Vintern - 1B - v3_5](https://huggingface.co/5CD - AI/Vintern - 1B - v3_5) vs Vintern - 3B - R - beta? 🤔
- Vintern - 1B - v3_5: Fast ⚡ and good for Vietnamese OCR with simple text formatting. 📝 Highly reliable. ✅
- Vintern - 3B - R - beta: Better for complex questions and complex structured doc image. 🔍📚 OCR performance on blurred or unclear text may be slightly affected due to our training focus on reasoning. 🔍🤖
🚀 The next step? Training and enhancing its reasoning ability by Reinforcement Learning!
📈 Benchmarks
💻 Usage Examples
Basic Usage
Example 1
<SUMMARY>The problem requires determining the most expensive dish on the menu of a restaurant. To solve this problem, we need to analyze the price of each dish from the menu and compare them to find the most expensive one.</SUMMARY>
<CAPTION>The image shows a part of the restaurant's menu, listing various dishes with their prices. The menu is divided into two columns, each listing different dishes with corresponding prices.</CAPTION>
<FIND_CANDIDATES_REASONING>To find the most expensive dish, we need to perform the following steps:
1. **Identify the list of dishes**: The menu provides a complete list of dishes, including: Stir - fried frog with lemongrass and chili, Deep - fried frog with fish sauce, Grilled chicken with chili, Deep - fried chicken with fish sauce, Stir - fried whelk with lemongrass and chili, Stir - fried scallop with lemongrass and chili, Stir - fried scallop with garlic butter, Stir - fried fragrant whelk with lemongrass and chili, Stir - fried fragrant whelk with garlic butter, Crab cake with water spinach, Stir - fried baozi with basa fish, Grilled pork fingers with chili, Stir - fried wild boar with lemongrass and chili, Stir - fried wild boar with coriander, Stir - fried beef with lemongrass and chili, Stir - fried beef with coriander, Grilled snakehead fish wrapped in lettuce (2 pieces), Grilled snakehead fish with green chili (2 pieces), Chicken soup with giang leaves, Frog soup with giang leaves, Squid soup, Beef soup.
2. **Determine the price of each dish**: The price of each dish is clearly written next to the dish name. For example, stir - fried frog with lemongrass and chili costs 55k, deep - fried frog with fish sauce costs 55k, grilled chicken with chili costs 50k, etc.
3. **Compare the prices of dishes**: Compare the prices of each dish to find the most expensive one. This process requires carefulness to ensure that no dish is overlooked.</FIND_CANDIDATES_REASONING>
<TOP3_CANDIDATES>
1. Grilled snakehead fish wrapped in lettuce (2 pieces) 65k
2. Grilled snakehead fish with green chili (2 pieces) 65k
3. Beef soup 69k
</TOP3_CANDIDATES>
<REASONING>To determine the most expensive dish, we need to compare the prices of all dishes on the menu. Based on the information in the image, we can create the following price table:
| Dish | Price (VND) |
|-------------------------------|--------|
| Stir - fried frog with lemongrass and chili | 55k |
| Deep - fried frog with fish sauce | 55k |
| Grilled chicken with chili | 50k |
| Deep - fried chicken with fish sauce | 50k |
| Stir - fried whelk with lemongrass and chili | 45k |
| Stir - fried scallop with lemongrass and chili | 49k |
| Stir - fried scallop with garlic butter | 49k |
| Stir - fried fragrant whelk with lemongrass and chili | 59k |
| Stir - fried fragrant whelk with garlic butter | 59k |
| Crab cake with water spinach | 55k |
| Stir - fried baozi with basa fish | 55k |
| Grilled pork fingers with chili | 50k |
| Stir - fried wild boar with lemongrass and chili | 55k |
| Stir - fried wild boar with coriander | 55k |
| Stir - fried beef with lemongrass and chili | 55k |
| Stir - fried beef with coriander | 55k |
| Grilled snakehead fish wrapped in lettuce (2 pieces) | 65k |
| Grilled snakehead fish with green chili (2 pieces) | 65k |
| Chicken soup with giang leaves | 59k |
| Frog soup with giang leaves | 65k |
| Squid soup | 65k |
| Beef soup | 69k |
After comparison, the most expensive dish is Beef soup with a price of 69k.</REASONING>
<COUNTER_ARGUMENTS>Those who oppose may present the following arguments:
- **Some dishes are cheaper**: Some dishes are cheaper, such as stir - fried frog with lemongrass and chili, deep - fried frog with fish sauce, grilled chicken with chili, deep - fried chicken with fish sauce, stir - fried whelk with lemongrass and chili, stir - fried scallop with lemongrass and chili, stir - fried scallop with garlic butter, stir - fried fragrant whelk with lemongrass and chili, stir - fried fragrant whelk with garlic butter, crab cake with water spinach, stir - fried baozi with basa fish, grilled pork fingers with chili, stir - fried wild boar with lemongrass and chili, stir - fried wild boar with coriander, stir - fried beef with lemongrass and chili, stir - fried beef with coriander, grilled snakehead fish wrapped in lettuce (2 pieces), grilled snakehead fish with green chili (2 pieces), chicken soup with giang leaves, frog soup with giang leaves, squid soup.
- **Prices may change**: The prices of dishes may change depending on the time and location. Comparing prices based on the information in the image may not accurately reflect the actual prices.</COUNTER_ARGUMENTS>
<REASONING>The opposing arguments focus on comparing the prices of cheaper dishes. However, the purpose of the problem is to determine the most expensive dish, not to compare the prices of all dishes. We have compared the prices of all dishes and concluded that the most expensive dish is Beef soup with a price of 69k.</REASONING>
<CONCLUSION>Beef soup</CONCLUSION>

Example 2

User: <image>
List all the text.
Assistant: GOVERNMENT
No.: 626/QĐ - TTg
SOCIALIST REPUBLIC OF VIET NAM
Independence - Freedom - Happiness
Hanoi, July 29, 2002
DECISION OF THE PRIME MINISTER
Approving the Feasibility Study Report of the Investment Project
Factory for the production of diammonium phosphate fertilizer (DAP) in the
Dinh Vu Economic Zone, Hai Phong City
PRIME MINISTER (GOVERNMENT)
Based on the Government Organization Law of December 25, 2001;
Based on Decree No. 52/1999/NĐ - CP of July 8, 1999 of the Government
(Regulations on investment management, construction, and the Digital Industry No. 12/2000/ND - CP
of May 5, 2000) of the Government, which stipulates amendments and supplements to some
articles of the Regulations on investment management and construction.
Considering the proposal of the Vietnam Chemical Corporation (Document No. 916/CV - HĐQT
of November 1, 2001), the appraisal opinion of the Ministry of Planning and Investment in
Document No. 1944 BKH/VPTD of April 1, 2002, regarding the approval of
the Feasibility Study Report of the Investment Project for a Factory for the production of diammonium
phosphate fertilizer in the Dinh Vu Economic Zone, Hai Phong City, and the opinions of relevant
ministries and sectors at the meeting on June 19, 2002,
DECIDES:
Article 1. Approve the Feasibility Study Report of the Investment Project for a Factory for the
production of DAP fertilizer in the Dinh Vu Economic Zone, Hai Phong City, with the following main
contents:
1. Project name: Factory for the production of DAP fertilizer in the Dinh Vu Economic Zone,
Hai Phong City.
2. Investment objective: To contribute to ensuring stability and initiative in the supply of DAP
fertilizer for agricultural development, limit imports, and make more effective use of domestic apatite
mineral resources.
3. Investor: Vietnam Chemical Corporation.
Advanced Usage
import numpy as np
import torch
import torchvision.transforms as T
# from decord import VideoReader, cpu
from PIL import Image
from torchvision.transforms.functional import InterpolationMode
from transformers import AutoModel, AutoTokenizer
IMAGENET_MEAN = (0.485, 0.456, 0.406)
IMAGENET_STD = (0.229, 0.224, 0.225)
def build_transform(input_size):
MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
transform = T.Compose([
T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),
T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
T.ToTensor(),
T.Normalize(mean=MEAN, std=STD)
])
return transform
def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
best_ratio_diff = float('inf')
best_ratio = (1, 1)
area = width * height
for ratio in target_ratios:
target_aspect_ratio = ratio[0] / ratio[1]
ratio_diff = abs(aspect_ratio - target_aspect_ratio)
if ratio_diff < best_ratio_diff:
best_ratio_diff = ratio_diff
best_ratio = ratio
elif ratio_diff == best_ratio_diff:
if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
best_ratio = ratio
return best_ratio
def dynamic_preprocess(image, min_num=1, max_num=12, image_size=448, use_thumbnail=False):
orig_width, orig_height = image.size
aspect_ratio = orig_width / orig_height
# calculate the existing image aspect ratio
target_ratios = set(
(i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if
i * j <= max_num and i * j >= min_num)
target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])
# find the closest aspect ratio to the target
target_aspect_ratio = find_closest_aspect_ratio(
aspect_ratio, target_ratios, orig_width, orig_height, image_size)
# calculate the target width and height
target_width = image_size * target_aspect_ratio[0]
target_height = image_size * target_aspect_ratio[1]
blocks = target_aspect_ratio[0] * target_aspect_ratio[1]
# resize the image
resized_img = image.resize((target_width, target_height))
processed_images = []
for i in range(blocks):
box = (
(i % (target_width // image_size)) * image_size,
(i // (target_width // image_size)) * image_size,
((i % (target_width // image_size)) + 1) * image_size,
((i // (target_width // image_size)) + 1) * image_size
)
# split the image
split_img = resized_img.crop(box)
processed_images.append(split_img)
assert len(processed_images) == blocks
if use_thumbnail and len(processed_images) != 1:
thumbnail_img = image.resize((image_size, image_size))
processed_images.append(thumbnail_img)
return processed_images
def load_image(image_file, input_size=448, max_num=12):
image = Image.open(image_file).convert('RGB')
transform = build_transform(input_size=input_size)
images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
pixel_values = [transform(image) for image in images]
pixel_values = torch.stack(pixel_values)
return pixel_values
model = AutoModel.from_pretrained(
"5CD - AI/Vintern - 3B - R - beta",
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
trust_remote_code=True,
use_flash_attn=False,
).eval().cuda()
tokenizer = AutoTokenizer.from_pretrained("5CD - AI/Vintern - 3B - R - beta", trust_remote_code=True, use_fast=False)
test_image = 'test - image.jpg'
think_prompt_format = """<image>\nYou are a very cautious and skeptical person. Please answer the following question in Vietnamese. When reasoning, you usually list evidence to point out possible answers, reason, and explain why you choose and eliminate before giving the final answer.
Question:
{question_input}
Please answer at length in the following format:
<SUMMARY>...</SUMMARY>
<CAPTION>...</CAPTION>
<INFORMATION_EXTRACT>...</INFORMATION_EXTRACT>
<EXTERNAL_KNOWLEDGE_EXPANSION>...</EXTERNAL_KNOWLEDGE_EXPANSION>
<FIND_CANDIDATES_REASONING>...</FIND_CANDIDATES_REASONING>
<TOP3_CANDIDATES>...</TOP3_CANDIDATES>
<REASONING_PLAN>...</REASONING_PLAN>
<REASONING>...</REASONING>
<COUNTER_ARGUMENTS>...</COUNTER_ARGUMENTS>
<VALIDATION_REASONING>...</VALIDATION_REASONING>
<CONCLUSION>...</CONCLUSION>
"""
pixel_values = load_image(test_image, max_num=6).to(torch.bfloat16).cuda()
generation_config = dict(max_new_tokens= 1024, do_sample=False, num_beams = 3, repetition_penalty=2.5)
question = '<image>\nExtract the main information in the image and return it in markdown format.'
response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=None, return_history=True)
print(f'User: {question}\nAssistant: {response}')
#question = "Another question ......"
#response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=history, return_history=True)
#print(f'User: {question}\nAssistant: {response}')
📄 License
This project is under the MIT license.
📚 Citation
@misc{doan2024vintern1befficientmultimodallarge,
title={Vintern-1B: An Efficient Multimodal Large Language Model for Vietnamese},
author={Khang T. Doan and Bao G. Huynh and Dung T. Hoang and Thuc D. Pham and Nhat H. Pham and Quan T. M. Nguyen and Bang Q. Vo and Suong N. Hoang},
year={2024},
eprint={2408.12480},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2408.12480},
}
📚 Reference
[1] Z. Chen et al., ‘Expanding performance boundaries of open - source multimodal models with model, data, and test - time scaling’, arXiv preprint arXiv:2412. 05271, 2024.






