Finetune_VQA_1B Open-Source Visual Question-Answering Model - Supports Vietnamese, Facilitates Image Content Understanding and Q&A

Finetune VQA 1B

Developed by TienAnh

A visual question answering model fine-tuned based on InternVL3-1B and Vintern-1B-v3_5, supporting Vietnamese, suitable for image content understanding and question-answering tasks.

Text-to-Image

Safetensors

OtherOpen Source License:Apache-2.0 #Vietnamese Visual Question Answering #Dynamic Image Slice Processing #Multimodal Large Model

Downloads 20

Release Time : 5/10/2025

Model Overview

This model is a visual question answering (VQA) model capable of understanding image content and answering related questions. Fine-tuned based on the InternVL3-1B and Vintern-1B-v3_5 architectures, it is specifically optimized for Vietnamese language support.

Model Features

Multi-slice Image Processing

Supports dynamic image preprocessing, automatically dividing images into multiple slices to maintain aspect ratio and improve processing efficiency.

Vietnamese Optimization

Specifically optimized and fine-tuned for Vietnamese, performing well in Vietnamese visual question-answering tasks.

Efficient Inference

Supports bfloat16 precision and optional flash attention, improving inference speed while maintaining accuracy.

Model Capabilities

Image Content Understanding

Visual Question Answering

Key Information Extraction from Images

Multilingual Support (Primarily Vietnamese)

Use Cases

Education

Vietnamese Learning Assistance

Helps students understand Vietnamese vocabulary and expressions through images.

Enhances language learning efficiency and engagement.

Content Moderation

Image Content Analysis

Automatically analyzes image content and answers related questions.

Improves moderation efficiency and accuracy.

🚀 Visual Question Answering Model

This project is a visual question - answering model that can extract key information from images and respond in a specific format. It is based on pre - trained models and fine - tuned for visual question - answering tasks.

🚀 Quick Start

Prerequisites

Install necessary Python libraries such as numpy, torch, torchvision, transformers, etc.

Code Example

import numpy as np
import torch
import torchvision.transforms as T
# from decord import VideoReader, cpu
from PIL import Image
from torchvision.transforms.functional import InterpolationMode
from transformers import AutoModel, AutoTokenizer

IMAGENET_MEAN = (0.485, 0.456, 0.406)
IMAGENET_STD = (0.229, 0.224, 0.225)

def build_transform(input_size):
    MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
    transform = T.Compose([
        T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),
        T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
        T.ToTensor(),
        T.Normalize(mean=MEAN, std=STD)
    ])
    return transform

def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
    best_ratio_diff = float('inf')
    best_ratio = (1, 1)
    area = width * height
    for ratio in target_ratios:
        target_aspect_ratio = ratio[0] / ratio[1]
        ratio_diff = abs(aspect_ratio - target_aspect_ratio)
        if ratio_diff < best_ratio_diff:
            best_ratio_diff = ratio_diff
            best_ratio = ratio
        elif ratio_diff == best_ratio_diff:
            if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
                best_ratio = ratio
    return best_ratio

def dynamic_preprocess(image, min_num=1, max_num=12, image_size=448, use_thumbnail=False):
    orig_width, orig_height = image.size
    aspect_ratio = orig_width / orig_height

    # calculate the existing image aspect ratio
    target_ratios = set(
        (i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if
        i * j <= max_num and i * j >= min_num)
    target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])

    # find the closest aspect ratio to the target
    target_aspect_ratio = find_closest_aspect_ratio(
        aspect_ratio, target_ratios, orig_width, orig_height, image_size)

    # calculate the target width and height
    target_width = image_size * target_aspect_ratio[0]
    target_height = image_size * target_aspect_ratio[1]
    blocks = target_aspect_ratio[0] * target_aspect_ratio[1]

    # resize the image
    resized_img = image.resize((target_width, target_height))
    processed_images = []
    for i in range(blocks):
        box = (
            (i % (target_width // image_size)) * image_size,
            (i // (target_width // image_size)) * image_size,
            ((i % (target_width // image_size)) + 1) * image_size,
            ((i // (target_width // image_size)) + 1) * image_size
        )
        # split the image
        split_img = resized_img.crop(box)
        processed_images.append(split_img)
    assert len(processed_images) == blocks
    if use_thumbnail and len(processed_images) != 1:
        thumbnail_img = image.resize((image_size, image_size))
        processed_images.append(thumbnail_img)
    return processed_images

def load_image(image_file, input_size=448, max_num=12):
    image = Image.open(image_file).convert('RGB')
    transform = build_transform(input_size=input_size)
    images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
    pixel_values = [transform(image) for image in images]
    pixel_values = torch.stack(pixel_values)
    return pixel_values

model = AutoModel.from_pretrained(
    "TienAnh/Finetune_VQA_1B",
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    trust_remote_code=True,
    use_flash_attn=False,
).eval().cuda()

tokenizer = AutoTokenizer.from_pretrained("TienAnh/Finetune_VQA_1B", trust_remote_code=True, use_fast=False)

test_image = 'test-image.jpg'

pixel_values = load_image(test_image, max_num=6).to(torch.bfloat16).cuda()
generation_config = dict(max_new_tokens= 1024, do_sample=False, num_beams = 3, repetition_penalty=2.5)

question = '<image>\nExtract the main information in the image and return it in markdown format.'

response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=None, return_history=True)
print(f'User: {question}\nAssistant: {response}')

#question = "Another question ......"
#response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=history, return_history=True)
#print(f'User: {question}\nAssistant: {response}')

📦 Installation

This project depends on the following libraries:

numpy
torch
torchvision
transformers
Pillow

You can install these libraries using pip:

pip install numpy torch torchvision transformers pillow

📚 Documentation

Model Information

Property	Details
Model Type	Visual Question Answering
Base Model	OpenGVLab/InternVL3 - 1B, 5CD - AI/Vintern - 1B - v3_5
Fine - tuned Model	TienAnh/Finetune_VQA_1B

Usage Steps

Load the model and tokenizer: Use AutoModel.from_pretrained and AutoTokenizer.from_pretrained to load the fine - tuned model and tokenizer.
Preprocess the image: Use the load_image function to preprocess the input image.
Generate a response: Use the model.chat method to generate a response to the question.

📄 License

This project is licensed under the Apache - 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご