ViGoRL-7b-Spatial Open-Source Vision-Language Model - Accurately associate text with coordinates to achieve visual reasoning and positioning

Vigorl 7b Spatial

Developed by gsarch

ViGoRL is a vision-language model fine-tuned through reinforcement learning, used to clearly associate text reasoning steps with visual coordinates to achieve precise visual reasoning and positioning.

Text-to-Image

Transformers

#Multi-round visual positioning #Reinforcement learning fine-tuning #Regional visual reasoning

Downloads 319

Release Time : 6/19/2025

Model Overview

ViGoRL is a vision-language model fine-tuned through reinforcement learning (RL) to clearly anchor text reasoning steps to visual coordinates. Inspired by human visual cognition, ViGoRL adopts multi-round visual positioning and dynamically scales image regions to perform fine-grained visual reasoning and positioning.

Model Features

Multi-round visual positioning

Inspired by human visual cognition, ViGoRL adopts multi-round visual positioning and dynamically scales image regions to perform fine-grained visual reasoning and positioning.

Precise visual reasoning

This model performs excellently in visual reasoning tasks that require precise visual positioning and regional reasoning.

Multiple training paradigms

The model is trained on visually grounded reasoning trajectories generated by Monte Carlo Tree Search (MCTS) using supervised fine-tuning (SFT), followed by reinforcement learning using Group Relative Policy Optimization (GRPO).

Model Capabilities

Visual reasoning

Visual positioning

Multi-round interaction

Dynamically scale image regions

Use Cases

Spatial reasoning

SAT - 2

Used for spatial reasoning tasks

BLINK

Used for spatial reasoning tasks

RoboSpatial

Used for spatial reasoning tasks

Visual search

V*Bench

Used for visual search tasks

Web interaction and positioning

ScreenSpot (Pro and V2)

Used for web interaction and positioning tasks

VisualWebArena

Used for web interaction and positioning tasks

🚀 ViGoRL: Visually Grounded Reinforcement Learning for Visual Reasoning

This model, ViGoRL, is designed to address visual reasoning tasks. It uses reinforcement learning to connect textual reasoning with visual coordinates, offering a novel approach to visual understanding.

🚀 Quick Start

You can load this model easily using Hugging Face's Transformers library. Here is a code example to get you started:

from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info
import torch

# # default: Load the model on the available device(s)
# model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
#     "", torch_dtype="auto", device_map="auto"
# ) # replace with any of the ViGoRL models

# We recommend enabling flash_attention_2 for better acceleration and memory saving.
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "",
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
    device_map="auto",
)

# default processer
processor = AutoProcessor.from_pretrained("")

# The default range for the number of visual tokens per image in the model is 4-16384.
# You can set min_pixels and max_pixels according to your needs, such as a token range of 256-1280, to balance performance and cost.
# min_pixels = 256*28*28
# max_pixels = 1280*28*28
# processor = AutoProcessor.from_pretrained("", min_pixels=min_pixels, max_pixels=max_pixels)

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "path/to/image.png",
            },
            {"type": "text", "text": "QUERY HERE"},
        ],
    }
]

# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=512)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text) # this will output a single tool call turn of the model if version is multiturn.

⚠️ Important Note

This model requires a system prompt for proper usage. Please see the model's chat template for details.

✨ Features

Visual Grounding: ViGoRL explicitly anchors textual reasoning steps to visual coordinates, enabling precise visual reasoning.
Multi - turn Visual Grounding: Inspired by human visual cognition, it can dynamically zoom into image regions for fine - grained reasoning.
Reinforcement Learning: Trained using RL techniques like Group Relative Policy Optimization (GRPO) for better performance.

📦 Installation

The model can be loaded using the Hugging Face's Transformers library. The installation of the necessary libraries can be done via pip or other package managers. For example:

pip install transformers

💻 Usage Examples

Basic Usage

# The code above is a basic usage example. It shows how to load the model, prepare the inputs, and perform inference.

Advanced Usage

The advanced usage might involve customizing the input parameters such as min_pixels and max_pixels to balance performance and cost. You can also modify the messages structure to suit different types of queries.

📚 Documentation

Model Overview

ViGoRL is a vision - language model fine - tuned using reinforcement learning (RL) to explicitly anchor textual reasoning steps to visual coordinates. Inspired by human visual cognition, ViGoRL employs multi - turn visual grounding, dynamically zooming into image regions to perform fine - grained visual reasoning and grounding.

This model was trained using supervised fine - tuning (SFT) on visually - grounded reasoning traces generated via Monte Carlo Tree Search (MCTS), followed by reinforcement learning with Group Relative Policy Optimization (GRPO).

Model Details

Property	Details
Base Architecture	Qwen2.5 - Vision - Language (3B or 7B parameters)
Training Paradigm	Supervised Fine - Tuning on MCTS - generated reasoning traces; Group Relative Policy Optimization (GRPO); Multi - turn visual grounding with dynamic zoom - in feedback (if "Multiturn" appears in name)

Use Cases

This model excels in visual reasoning tasks that require precise visual grounding and region - level reasoning.

Use Case	Specific Domains
Spatial Reasoning	SAT - 2, BLINK, RoboSpatial
Visual Search	V*Bench
Web Interaction and Grounding	ScreenSpot (Pro and V2), VisualWebArena

🔧 Technical Details

The model was trained using supervised fine - tuning (SFT) on visually - grounded reasoning traces generated via Monte Carlo Tree Search (MCTS). After that, reinforcement learning with Group Relative Policy Optimization (GRPO) was applied. The multi - turn visual grounding mechanism is inspired by human visual cognition, which allows the model to dynamically zoom into image regions for fine - grained reasoning.

📄 License

No license information is provided in the original document.

📦 Datasets and Training Data

Training datasets and generated reasoning chains are publicly available:

📖 Citation

If you use ViGoRL in your research or applications, please cite our paper:

@article{sarch2025vigorl,
    title={Grounded Reinforcement Learning for Visual Reasoning},
    author={Sarch, Gabriel and Saha, Snigdha and Khandelwal, Naitik and Jain, Ayush and Tarr, Michael J and Kumar, Aviral and Fragkiadaki, Katerina},
    year={2025}
}

📞 Contact

For questions, feedback, or collaborations, please reach out to Gabriel Sarch or open an issue in our GitHub repository.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご