SpaceQwen2.5-VL-3B-Instruct Open-Source Multimodal Model - Free Upgrades for Spatial Reasoning Skills

Spaceqwen2.5 VL 3B Instruct

Developed by remyxai

A multimodal vision-language model fine-tuned based on Qwen2.5-VL-3B-Instruct, focusing on spatial reasoning capabilities

Text-to-Image EnglishOpen Source License:Apache-2.0 #Spatial Reasoning #Embodied Intelligence #Multimodal VLM

Downloads 7,446

Release Time : 1/29/2025

Model Overview

This model enhances spatial reasoning abilities through LoRA fine-tuning, capable of handling visual question-answering tasks related to spatial relationships between objects, suitable for scenarios such as robotic navigation and embodied intelligence

Model Features

Enhanced Spatial Reasoning

Trained with synthetic data, specifically optimized for spatial reasoning abilities such as distance estimation and orientation judgment

Multimodal Understanding

Capable of processing both image and text inputs to understand object relationships in visual scenes

Lightweight Fine-tuning

Efficient fine-tuning using the LoRA method, adding specific functionalities while preserving the base model's capabilities

Model Capabilities

Visual Question Answering

Spatial Relationship Reasoning

Distance Estimation

Object Localization

Multimodal Understanding

Use Cases

Robotic Navigation

Warehouse Environment Navigation

Assists robots in understanding spatial relationships between objects in warehouse environments

Can accurately answer questions about object positions and distances

Embodied Intelligence

Environmental Interaction

Provides spatial awareness for embodied intelligent agents

Enables agents to better interact with their environment

🚀 SpaceQwen2.5-VL-3B-Instruct

This multimodal vision - language model enhances spatial reasoning capabilities using data synthesis and expert model pipelines.

🚀 Quick Start

✨ Features

Model Type: Multimodal, Vision - Language Model
Architecture: Qwen2.5-VL-3B-Instruct
Model Size: 3.75B parameters (FP16)
Finetuned from: Qwen/Qwen2.5-VL-3B-Instruct
Finetune Strategy: LoRA (Low - Rank Adaptation)
License: Apache - 2.0

This model uses data synthesis techniques and publicly available models to reproduce the work described in SpatialVLM to enhance the spatial reasoning of multimodal models. With a pipeline of expert models, we can infer spatial relationships between objects in a scene to create VQA dataset for spatial reasoning.

📦 Installation

Transformers

Install qwen dependencies:

pip install qwen-vl-utils[decord]==0.0.8

💻 Usage Examples

Basic Usage

To run inference on a sample image:

from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "remyxai/SpaceQwen2.5-VL-3B-Instruct", torch_dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained("remyxai/SpaceQwen2.5-VL-3B-Instruct")

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://raw.githubusercontent.com/remyxai/VQASynth/refs/heads/main/assets/warehouse_sample_2.jpeg",
            },
            {"type": "text", "text": "What is the height of the man in the red hat in feet?"},
        ],
    }
]

# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

Advanced Usage

Or run SpaceQwen2.5-VL-3B-Instruct using llama.cpp:

./llama-qwen2vl-cli -m /path/to/SpaceQwen2.5-VL-3B-Instruct/SpaceQwen2.5-VL-3B-Instruct-F16.gguf \
                    --mmproj /path/to/SpaceQwen2.5-VL-3B-Instruct/spaceqwen2.5-vl-3b-instruct-vision.gguf \
                    -p "What's the height of the man in the red hat?" \
                    --image /path/to/warehouse_sample_2.jpeg --threads 24 -ngl 99

📚 Documentation

Dataset & Training

SpaceQwen2.5-VL-3B-Instruct uses LoRA to fine - tune Qwen2.5-VL-3B-Instruct on the OpenSpaces dataset.

Dataset Summary:

~10k synthetic spatial reasoning traces
Question types: spatial relations (distances (units), above, left - of, contains, closest to)
Format: image (RGB) + question + answer
Dataset: OpenSpaces
Code: VQASynth
Reference: SpatialVLM

Scripts for LoRA SFT available at trl

Model Evaluation (Coming Soon)

Stay tuned for the VLMEvalKit QSpatial benchmark

Planned comparisons:

🌋 SpaceLLaVA
🧑‍🏫 SpaceQwen2.5-VL-3B-Instruct
🤖 Related VLMs and VLAs for robotics

You can also try it on Discord or the HF space.

🔧 Technical Details

Limitations & Ethical Considerations

⚠️ Important Note

Performance may degrade in cluttered environments or camera perspective.

This model was fine - tuned using synthetic reasoning over an internet image dataset.

Multimodal biases inherent to the base model (Qwen2.5-VL) may persist.

Not intended for use in safety - critical or legal decision - making.

💡 Usage Tip

Users are encouraged to evaluate outputs critically and consider fine - tuning for domain - specific safety and performance.

📄 License

This project is licensed under the Apache - 2.0 license.

Citation

@article{chen2024spatialvlm,
  title = {SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities},
  author = {Chen, Boyuan and Xu, Zhuo and Kirmani, Sean and Ichter, Brian and Driess, Danny and Florence, Pete and Sadigh, Dorsa and Guibas, Leonidas and Xia, Fei},
  journal = {arXiv preprint arXiv:2401.12168},
  year = {2024},
  url = {https://arxiv.org/abs/2401.12168},
}

@misc{qwen2.5-VL,
    title = {Qwen2.5-VL},
    url = {https://qwenlm.github.io/blog/qwen2.5-vl/},
    author = {Qwen Team},
    month = {January},
    year = {2025}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご