SpaceThinker-Qwen2.5VL-3B Open-source Multimodal Model -Enhanced Spatial Reasoning and Object Relationship Analysis Capabilities

Spacethinker Qwen2.5VL 3B

Developed by remyxai

SpaceThinker is a multimodal vision-language model that enhances spatial reasoning through test-time computation, excelling particularly in quantitative spatial reasoning and object relationship analysis.

Text-to-Image EnglishOpen Source License:Apache-2.0 #Spatial Distance Estimation #Multimodal Reasoning #Embodied AI Navigation

Downloads 490

Release Time : 4/17/2025

Model Overview

A vision-language model fine-tuned based on the Qwen2.5-VL-3B architecture, focusing on improving spatial reasoning capabilities, suitable for embodied AI applications requiring precise spatial understanding and planning.

Model Features

Enhanced Spatial Reasoning

Improves quantitative reasoning for distance, size, and object relationships through test-time computation augmentation.

Multimodal Understanding

Capable of processing both image and text inputs for complex visual-language reasoning.

Embodied AI Optimization

Particularly suitable for applications like robotics and drones that require spatial planning and navigation.

Model Capabilities

Quantitative Spatial Reasoning

Distance Estimation

Object Relationship Analysis

Visual Question Answering

3D Scene Understanding

Multimodal Reasoning

Use Cases

Robotic Navigation

Environmental Spatial Analysis

Helps robots understand spatial relationships between objects in their surroundings.

Improves navigation and obstacle avoidance capabilities.

Drone Applications

Aerial Distance Estimation

Estimates distances between drones and ground or aerial objects.

Enhances flight safety and mission planning capabilities.

Augmented Reality

Virtual Object Placement

Analyzes spatial characteristics of real scenes to appropriately place virtual objects.

Improves realism in AR experiences.

🚀 SpaceThinker-Qwen2.5VL-3B

SpaceThinker-Qwen2.5VL-3B is a thinking/reasoning multimodal/vision-language model (VLM). It enhances spatial reasoning with test-time compute and is trained on synthetic reasoning traces. This model addresses the need for accurate spatial reasoning in multimodal tasks, offering valuable insights for applications like robotics and visual question - answering.

🚀 Quick Start

Try the SpaceThinker Space

You can try the SpaceThinker Space by clicking the following link:

✨ Features

Multimodal Capability: It's a multimodal, vision - language model, enabling it to handle both images and text.
Enhanced Spatial Reasoning: Improves quantitative spatial reasoning, including distance estimation and object relation understanding.
Fine - Tuned Architecture: Finetuned from UCSC - VLAA/VLAA - Thinker - Qwen2.5VL - 3B using the LoRA strategy.

Model Information

Property	Details
Model Type	Multimodal, Vision - Language Model
Architecture	`Qwen2.5 - VL - 3B`
Model Size	3.75B parameters (FP16)
Finetuned from	`UCSC - VLAA/VLAA - Thinker - Qwen2.5VL - 3B`
Finetune Strategy	LoRA (Low - Rank Adaptation)
License	Apache - 2.0

Check out the SpaceThinker collection

📦 Installation

Run locally with llama.cpp

To run locally with llama.cpp, install and build this [branch](https://github.com/HimariO/llama.cpp.qwen2.5vl/tree/qwen25 - vl) and download the .gguf weights here. Then use the following command:

./llama-qwen2vl-cli -m spacethinker-qwen2.5VL-3B-F16.gguf
--mmproj spacethinker-qwen2.5vl-3b-vision.gguf
--image images/example_1.jpg --threads 24 -ngl 9
-p "Does the man in blue shirt working have a greater \\
height compared to the wooden pallet with boxes on floor?"

Run using llama.cpp in colab

You can run it in Colab by clicking the following link:

Run locally using Transformers

import torch
from PIL import Image
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
import requests
from io import BytesIO

# Configuration
model_id = "remyxai/SpaceThinker-Qwen2.5VL-3B"
image_path = "images/example_1.jpg"  # or local path
prompt = "What can you infer from this image about the environment?"
system_message = (
  "You are VL-Thinking 🤔, a helpful assistant with excellent reasoning ability. "
  "You should first think about the reasoning process and then provide the answer. "
  "Use <think>...</think> and <answer>...</answer> tags."
)

# Load model and processor
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    model_id, device_map="auto", torch_dtype=torch.bfloat16
)
processor = AutoProcessor.from_pretrained(model_id)

# Load and preprocess image
if image_path.startswith("http"):
    image = Image.open(BytesIO(requests.get(image_path).content)).convert("RGB")
else:
    image = Image.open(image_path).convert("RGB")
if image.width > 512:
    ratio = image.height / image.width
    image = image.resize((512, int(512 * ratio)), Image.Resampling.LANCZOS)

# Format input
chat = [
    {"role": "system", "content": [{"type": "text", "text": system_message}]},
    {"role": "user", "content": [{"type": "image", "image": image},
                                {"type": "text", "text": prompt}]}
]
text_input = processor.apply_chat_template(chat, tokenize=False,
                                                  add_generation_prompt=True)

# Tokenize
inputs = processor(text=[text_input], images=[image],
                                      return_tensors="pt").to("cuda")

# Generate response
generated_ids = model.generate(**inputs, max_new_tokens=1024)
output = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]

print("Response:\n", output)

💻 Usage Examples

Basic Usage

import torch
from PIL import Image
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
import requests
from io import BytesIO

# Configuration
model_id = "remyxai/SpaceThinker-Qwen2.5VL-3B"
image_path = "images/example_1.jpg"  # or local path
prompt = "What can you infer from this image about the environment?"
system_message = (
  "You are VL-Thinking 🤔, a helpful assistant with excellent reasoning ability. "
  "You should first think about the reasoning process and then provide the answer. "
  "Use <think>...</think> and <answer>...</answer> tags."
)

# Load model and processor
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    model_id, device_map="auto", torch_dtype=torch.bfloat16
)
processor = AutoProcessor.from_pretrained(model_id)

# Load and preprocess image
if image_path.startswith("http"):
    image = Image.open(BytesIO(requests.get(image_path).content)).convert("RGB")
else:
    image = Image.open(image_path).convert("RGB")
if image.width > 512:
    ratio = image.height / image.width
    image = image.resize((512, int(512 * ratio)), Image.Resampling.LANCZOS)

# Format input
chat = [
    {"role": "system", "content": [{"type": "text", "text": system_message}]},
    {"role": "user", "content": [{"type": "image", "image": image},
                                {"type": "text", "text": prompt}]}
]
text_input = processor.apply_chat_template(chat, tokenize=False,
                                                  add_generation_prompt=True)

# Tokenize
inputs = processor(text=[text_input], images=[image],
                                      return_tensors="pt").to("cuda")

# Generate response
generated_ids = model.generate(**inputs, max_new_tokens=1024)
output = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]

print("Response:\n", output)

Advanced Usage

./llama-qwen2vl-cli -m spacethinker-qwen2.5VL-3B-F16.gguf
--mmproj spacethinker-qwen2.5vl-3b-vision.gguf
--image images/example_1.jpg --threads 24 -ngl 9
-p "Does the man in blue shirt working have a greater \\
height compared to the wooden pallet with boxes on floor?"

📚 Documentation

Spatial Reasoning Capabilities

Strong quantitative spatial reasoning is crucial for embodied AI applications such as robotics and drones. SpaceThinker enhances these capabilities using test - time compute, trained with samples that ground the final response on a consistent explanation of scene observations.

Examples of SpaceThinker


Prompt: How far is the man in the red hat from the pallet of boxes in feet?	Prompt: How far is the Goal from the penalty taker in this image?	Prompt: Approximately how many meters apart are the chair and bookshelf?
Reasoning: To determine how far the man in the red hat is from the pallet of boxes, I need to consider the spatial arrangement in the warehouse. The man is walking on the floor, and there's a visible gap between him and the pallet. First, I'll estimate the size of the pallet. A standard pallet is typically around 4 feet wide and 8 feet long. Given that the man is standing near the edge of the pallet, he might be about 2–3 feet away from the front edge of the pallet. Next, I'll consider the distance from the pallet to the man. The man appears to be walking along the aisle, which seems to be about 6–7 feet wide. If he is standing at the edge of the pallet, he would likely be about 2–3 feet from the front edge of the pallet. Combining these observations, the man in the red hat is approximately 2–3 feet from the pallet of boxes.	Reasoning: First, I observe the scene where a player in a red uniform stands near the goal. The goal itself is a standard soccer goalpost, which typically extends 2.5 meters beyond the line of the goalposts. The player appears to be standing close to the goal, but not directly at the goal line. To estimate the distance, I consider that the player's position relative to the goal suggests he is likely standing closer than 2.5 meters away. Given the typical setup of a penalty kick scenario, the player would be positioned further back from the goal than the goal itself. By visual estimation, the player seems to be standing approximately 10–15 meters from the goal. This range aligns with the common distance for a penalty kick, considering the player's stance and the goal's structure. Thus, the distance from the goal to the penalty taker is likely around 10–15 meters.	Reasoning: To determine the distance between the chair and the bookshelf, I'll consider the typical dimensions of these objects. A standard office chair is around 60–70 cm tall, and a bookshelf can be anywhere from 1.2 to 1.8 meters tall. Given that the chair is likely placed on a desk or floor level, the height difference between the chair and the top of the bookshelf would be approximately 1 meter. Assuming the bookshelf is at least 1 meter high, and the chair is at ground level, the vertical distance between them is indeed 1 meter. The horizontal distance might vary depending on the layout, but given the room's dimensions and the placement of the chair relative to the bookshelf, the total distance between the chair and the top of the bookshelf is likely around 1 meter. Therefore, the chair and bookshelf are approximately 1 meter apart.
Answer: 2–3 feet	Answer: 10–15 meters	Answer: 1 meter

SpaceThinker Dataset

The SpaceThinker dataset includes over 12K synthetic spatial reasoning traces. It is designed to evaluate and train the model's spatial reasoning capabilities.

Dataset Summary

~12K synthetic spatial reasoning traces
Question types: spatial relations (distances (units), above, left - of, contains, closest to)
Format: image (RGB) + question + answer with reasoning traces
Dataset: remyxai/SpaceThinker
Code: Synthetize Spatial Reasoning Traces with VQASynth

Training SpaceThinker

PEFT Configuration

Architecture: Qwen2.5 - VL - 3B
Base model: UCSC - VLAA/VLAA - Thinker - Qwen2.5VL - 3B
Method: LoRA finetuning (PEFT)
LoRA Alpha: 256
LoRA Rank: 128
Target Modules: q_proj, v_proj
Optimizer: AdamW (lr = 2e - 5), batch size = 1, epochs = 3
Max input length: 1024 tokens

You can reproduce LoRA SFT training with the included script:

python train.py

Wandb logs are available here.

Model Evaluation

The [Q - Spatial - Bench dataset](https://huggingface.co/datasets/andrewliao11/Q - Spatial - Bench) is used to evaluate the quantitative spatial reasoning of SpaceThinker.

Evaluation Results

Default System Prompt:
- Prompts completed: 93 / 101
- Correct answers: 30
- Accuracy: 32.26%
Prompting for step - by - step reasoning using the [spatial prompt](https://github.com/andrewliao11/Q - Spatial - Bench - code/blob/main/prompt_templates/spatial_prompt_steps.txt) from Q - Spatial - Bench:
- Correct answers: 53
- Accuracy: 52.48%

QSpatial++ Comparison Table (4/25/25)

Model	SpaceThinker-Qwen2.5VL-3B	gpt-4o	gemini-2.5-pro-preview-03-25
QSpatial++ Predictions
Colab Notebook
Success Rate (%) ↑	55	43	52
Samples Completed ↑	99 / 100	95 / 100	99 / 100
sMAPE (%) ↓	66	71	62

Metric Notes

Success Rate (%): Higher is better ↑
Samples Completed: Higher is better ↑
sMAPE (%): Lower is better ↓

🔧 Technical Details

The model builds upon the ideas from SpatialVLM (Chen et al., 2024), introducing synthetic reasoning traces grounded on a 3D scene reconstruction pipeline using Molmo, VGGT, SAM2.

📄 License

This project is licensed under the Apache - 2.0 license.

Limitations

Performance may degrade in cluttered environments or camera perspective.
This model was fine - tuned using synthetic reasoning over an internet image dataset.
Multimodal biases inherent to the base model (Qwen2.5 - VL) may persist.
Not intended for use in safety - critical or legal decision - making.

⚠️ Important Note

Users are encouraged to evaluate outputs critically and consider fine - tuning for domain - specific safety and performance. Distances estimated using autoregressive transformers may help in higher - order reasoning for planning and behavior but may not be suitable replacements for measurements taken with high - precision sensors, calibrated stereo vision systems, or specialist monocular depth estimation models capable of more accurate, pixel - wise predictions and real - time performance.

Citation

@article{chen2024spatialvlm,
  title = {SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities},
  author = {Chen, Boyuan and Xu, Zhuo and Kirmani, Sean and Ichter, Brian and Driess, Danny and Florence, Pete and Sadigh, Dorsa and Guibas, Leonidas and Xia, Fei},
  journal = {arXiv preprint arXiv:2401.12168},
  year = {2024},
  url = {https://arxiv.org/abs/2401.12168},
}

@misc{qwen2.5-VL,
  title = {Qwen2.5-VL},
  url = {https://qwenlm.github.io/blog/qwen2.5-vl/},
  author = {Qwen Team},
  month = {January},
  year = {2025}
}

@misc{vl-thinking2025,
  title={SFT or RL? An Early Investigation into Training R1-Like Reasoning Large Vision-Language Models },
  author={Hardy Chen

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご