TBAC-VLR1-3B-preview Open-source Multimodal Language Model - Optimization and Upgrade for Outstanding Performance in Multimodal Reasoning

TBAC VLR1 3B Preview

Developed by TencentBAC

A multimodal language model fine-tuned by Tencent PCG Basic Algorithm Center, optimized based on Qwen2.5-VL-3B-Instruct, achieving state-of-the-art performance in multiple multimodal reasoning benchmarks among models of the same scale

Image-to-Text

Safetensors

EnglishOpen Source License:Apache-2.0 #Multimodal Reasoning #Mathematical Visual Question Answering #GRPO Optimization

Downloads 328

Release Time : 4/16/2025

Model Overview

A vision-language model enhanced with Group Relative Policy Optimization (GRPO) technology to improve multimodal reasoning capabilities

Model Features

GRPO Optimization Technology

Utilizes Group Relative Policy Optimization technology to enhance multimodal reasoning capabilities

Leading Performance

Achieves state-of-the-art performance in multiple multimodal reasoning benchmarks among models of the same scale

Mathematical Reasoning Capability

Excels in mathematical reasoning benchmarks such as MathVista

Model Capabilities

Multimodal Understanding

Vision-Language Reasoning

Mathematical Problem Solving

Logical Reasoning

Image-Text Generation

Use Cases

Education

Math Problem Solving

Analyzes questions containing mathematical formulas and diagrams

Achieves a score of 64.8 on the MathVista benchmark

Research

Multimodal Reasoning Research

Used for research on vision-language reasoning tasks

Achieves an average score of 35.7 in comprehensive evaluations

🚀 TBAC-VLR1-3B-preview

This is a multimodal language model fine-tuned by Tencent PCG Basic Algorithm Center. Based on Qwen2.5-VL-3B-Instruct, it uses Group Relative Policy Optimization (GRPO) to enhance multimodal reasoning ability, achieving state-of-the-art results on several multimodal reasoning benchmarks among models of the same size.

📦 Installation

No installation steps are provided in the original document, so this section is skipped.

✨ Features

Based on Qwen2.5-VL-3B-Instruct.
Uses Group Relative Policy Optimization (GRPO) to enhance multimodal reasoning ability.
Achieves state-of-the-art results on several multimodal reasoning benchmarks among models of the same size.

📚 Documentation

Performance

Property	Details
Model Type	Multimodal language model
Training Data	Not provided

Model	Average	MathVista	MathVision	MathVerse	DynaMath	WeMath	LogicVista
Qwen2-VL-2B	20.5	48.0	16.1	17.5	3.8	10.8	26.6
InternVL2.5-2B	21.2	51.1	14.0	22.3	4.4	8.0	27.3
InternVL3-2B	29.1	57.6	20.2	24.5	14.8	22.9	40.3
Qwen2.5-VL-3B	31.8	61.2	21.9	31.2	13.2	22.9	40.3
VLM-R1-3B-Math-0305	33.4	62.7	21.9	32.2	13.0	30.0	40.5
Taichu-VLR-3B	33.6	64.9	23.1	32.1	12.6	30.4	38.7
VLAA-Thinker-Qwen2.5VL-3B	35.4	61.0	24.4	36.4	18.2	33.8	38.5
TBAC-VLR1-3B-preview	35.7	64.8	25.0	33.2	17.7	32.4	40.8

Performance

The compared results are sourced from https://opencompass.org.cn.

The results of our model are self-reported, obtained by running evaluations offline on each benchmark.

💻 Usage Examples

Basic Usage

from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "TencentBAC/TBAC-VLR1-3B-preview", torch_dtype="auto", device_map="auto"
)

processor = AutoProcessor.from_pretrained("TencentBAC/TBAC-VLR1-3B-preview")

messages = [
    {
        "role": "system",
        "content": "You are a helpful assistant. The user asks a question, and you solve it. You need first think about the reasoning process in the mind and then provides the user with the answer. The answer are enclosed within \\boxed{} tags i.e., reasoning process here \\boxed{ answer here }."
    },
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": image_path,
            },
            {"type": "text", "text": query},
        ],
    }
]

# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=128, do_sample=False)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

Advanced Usage

No advanced usage examples are provided in the original document, so this part is skipped.

📄 License

The model is released under the Apache-2.0 license.

@misc{Xu2025tbacvlr1,
  title={TBAC-VLR1-3B-preview}, 
  author={Junzhe Xu and Yuyang yin},
  url={https://huggingface.co/TencentBAC/TBAC-VLR1-3B-preview},
  year={2025},
}

About

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご