đ ViGoRL: Visually Grounded Reinforcement Learning for Visual Reasoning
This model, ViGoRL, is designed to address visual reasoning tasks. It uses reinforcement learning to connect textual reasoning with visual coordinates, offering a novel approach to visual understanding.
đ Quick Start
You can load this model easily using Hugging Face's Transformers library. Here is a code example to get you started:
from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info
import torch
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
"",
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
device_map="auto",
)
processor = AutoProcessor.from_pretrained("")
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "path/to/image.png",
},
{"type": "text", "text": "QUERY HERE"},
],
}
]
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)
inputs = inputs.to("cuda")
generated_ids = model.generate(**inputs, max_new_tokens=512)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
â ī¸ Important Note
This model requires a system prompt for proper usage. Please see the model's chat template for details.
⨠Features
- Visual Grounding: ViGoRL explicitly anchors textual reasoning steps to visual coordinates, enabling precise visual reasoning.
- Multi - turn Visual Grounding: Inspired by human visual cognition, it can dynamically zoom into image regions for fine - grained reasoning.
- Reinforcement Learning: Trained using RL techniques like Group Relative Policy Optimization (GRPO) for better performance.
đĻ Installation
The model can be loaded using the Hugging Face's Transformers library. The installation of the necessary libraries can be done via pip
or other package managers. For example:
pip install transformers
đģ Usage Examples
Basic Usage
Advanced Usage
The advanced usage might involve customizing the input parameters such as min_pixels
and max_pixels
to balance performance and cost. You can also modify the messages
structure to suit different types of queries.
đ Documentation
Model Overview
ViGoRL is a vision - language model fine - tuned using reinforcement learning (RL) to explicitly anchor textual reasoning steps to visual coordinates. Inspired by human visual cognition, ViGoRL employs multi - turn visual grounding, dynamically zooming into image regions to perform fine - grained visual reasoning and grounding.
This model was trained using supervised fine - tuning (SFT) on visually - grounded reasoning traces generated via Monte Carlo Tree Search (MCTS), followed by reinforcement learning with Group Relative Policy Optimization (GRPO).
Model Details
Property |
Details |
Base Architecture |
Qwen2.5 - Vision - Language (3B or 7B parameters) |
Training Paradigm |
Supervised Fine - Tuning on MCTS - generated reasoning traces; Group Relative Policy Optimization (GRPO); Multi - turn visual grounding with dynamic zoom - in feedback (if "Multiturn" appears in name) |
Use Cases
This model excels in visual reasoning tasks that require precise visual grounding and region - level reasoning.
Use Case |
Specific Domains |
Spatial Reasoning |
SAT - 2, BLINK, RoboSpatial |
Visual Search |
V*Bench |
Web Interaction and Grounding |
ScreenSpot (Pro and V2), VisualWebArena |
đ§ Technical Details
The model was trained using supervised fine - tuning (SFT) on visually - grounded reasoning traces generated via Monte Carlo Tree Search (MCTS). After that, reinforcement learning with Group Relative Policy Optimization (GRPO) was applied. The multi - turn visual grounding mechanism is inspired by human visual cognition, which allows the model to dynamically zoom into image regions for fine - grained reasoning.
đ License
No license information is provided in the original document.
đĻ Datasets and Training Data
Training datasets and generated reasoning chains are publicly available:
đ Citation
If you use ViGoRL in your research or applications, please cite our paper:
@article{sarch2025vigorl,
title={Grounded Reinforcement Learning for Visual Reasoning},
author={Sarch, Gabriel and Saha, Snigdha and Khandelwal, Naitik and Jain, Ayush and Tarr, Michael J and Kumar, Aviral and Fragkiadaki, Katerina},
year={2025}
}
đ Contact
For questions, feedback, or collaborations, please reach out to Gabriel Sarch or open an issue in our GitHub repository.