V

Vigorl 7b Spatial

Developed by gsarch
ViGoRL is a vision-language model fine-tuned through reinforcement learning, used to clearly associate text reasoning steps with visual coordinates to achieve precise visual reasoning and positioning.
Downloads 319
Release Time : 6/19/2025

Model Overview

ViGoRL is a vision-language model fine-tuned through reinforcement learning (RL) to clearly anchor text reasoning steps to visual coordinates. Inspired by human visual cognition, ViGoRL adopts multi-round visual positioning and dynamically scales image regions to perform fine-grained visual reasoning and positioning.

Model Features

Multi-round visual positioning
Inspired by human visual cognition, ViGoRL adopts multi-round visual positioning and dynamically scales image regions to perform fine-grained visual reasoning and positioning.
Precise visual reasoning
This model performs excellently in visual reasoning tasks that require precise visual positioning and regional reasoning.
Multiple training paradigms
The model is trained on visually grounded reasoning trajectories generated by Monte Carlo Tree Search (MCTS) using supervised fine-tuning (SFT), followed by reinforcement learning using Group Relative Policy Optimization (GRPO).

Model Capabilities

Visual reasoning
Visual positioning
Multi-round interaction
Dynamically scale image regions

Use Cases

Spatial reasoning
SAT - 2
Used for spatial reasoning tasks
BLINK
Used for spatial reasoning tasks
RoboSpatial
Used for spatial reasoning tasks
Visual search
V*Bench
Used for visual search tasks
Web interaction and positioning
ScreenSpot (Pro and V2)
Used for web interaction and positioning tasks
VisualWebArena
Used for web interaction and positioning tasks
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
Š 2025AIbase