🚀 TBAC-VLR1-3B-preview
This is a multimodal language model fine-tuned by Tencent PCG Basic Algorithm Center. Based on Qwen2.5-VL-3B-Instruct, it uses Group Relative Policy Optimization (GRPO) to enhance multimodal reasoning ability, achieving state-of-the-art results on several multimodal reasoning benchmarks among models of the same size.
📦 Installation
No installation steps are provided in the original document, so this section is skipped.
✨ Features
- Based on Qwen2.5-VL-3B-Instruct.
- Uses Group Relative Policy Optimization (GRPO) to enhance multimodal reasoning ability.
- Achieves state-of-the-art results on several multimodal reasoning benchmarks among models of the same size.
📚 Documentation
Performance
Property |
Details |
Model Type |
Multimodal language model |
Training Data |
Not provided |
Model |
Average |
MathVista |
MathVision |
MathVerse |
DynaMath |
WeMath |
LogicVista |
Qwen2-VL-2B |
20.5 |
48.0 |
16.1 |
17.5 |
3.8 |
10.8 |
26.6 |
InternVL2.5-2B |
21.2 |
51.1 |
14.0 |
22.3 |
4.4 |
8.0 |
27.3 |
InternVL3-2B |
29.1 |
57.6 |
20.2 |
24.5 |
14.8 |
22.9 |
40.3 |
Qwen2.5-VL-3B |
31.8 |
61.2 |
21.9 |
31.2 |
13.2 |
22.9 |
40.3 |
VLM-R1-3B-Math-0305 |
33.4 |
62.7 |
21.9 |
32.2 |
13.0 |
30.0 |
40.5 |
Taichu-VLR-3B |
33.6 |
64.9 |
23.1 |
32.1 |
12.6 |
30.4 |
38.7 |
VLAA-Thinker-Qwen2.5VL-3B |
35.4 |
61.0 |
24.4 |
36.4 |
18.2 |
33.8 |
38.5 |
TBAC-VLR1-3B-preview |
35.7 |
64.8 |
25.0 |
33.2 |
17.7 |
32.4 |
40.8 |

The compared results are sourced from https://opencompass.org.cn.
The results of our model are self-reported, obtained by running evaluations offline on each benchmark.
💻 Usage Examples
Basic Usage
from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
"TencentBAC/TBAC-VLR1-3B-preview", torch_dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained("TencentBAC/TBAC-VLR1-3B-preview")
messages = [
{
"role": "system",
"content": "You are a helpful assistant. The user asks a question, and you solve it. You need first think about the reasoning process in the mind and then provides the user with the answer. The answer are enclosed within \\boxed{} tags i.e., reasoning process here \\boxed{ answer here }."
},
{
"role": "user",
"content": [
{
"type": "image",
"image": image_path,
},
{"type": "text", "text": query},
],
}
]
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)
inputs = inputs.to("cuda")
generated_ids = model.generate(**inputs, max_new_tokens=128, do_sample=False)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
Advanced Usage
No advanced usage examples are provided in the original document, so this part is skipped.
📄 License
The model is released under the Apache-2.0 license.
@misc{Xu2025tbacvlr1,
title={TBAC-VLR1-3B-preview},
author={Junzhe Xu and Yuyang yin},
url={https://huggingface.co/TencentBAC/TBAC-VLR1-3B-preview},
year={2025},
}
About
Created by the Tencent PCG Basic Algorithm Center. All rights reserved.