SpaceThinker-Qwen2.5VL-3B开源多模态模型 - 增强空间推理与物体关系分析能力

首页

Spacethinker Qwen2.5VL 3B

由 remyxai 开发

SpaceThinker是一款通过测试时计算增强空间推理能力的多模态视觉语言模型，特别擅长定量空间推理和物体关系分析。

文本生成图像英语开源协议:Apache-2.0 #空间距离估算 #多模态推理 #具身AI导航

下载量 490

发布时间 : 4/17/2025

模型简介

基于Qwen2.5-VL-3B架构微调的视觉语言模型，专注于提升空间推理能力，适用于需要精确空间理解和规划的具身AI应用。

模型特点

增强的空间推理能力

通过测试时计算增强对距离、大小和物体关系的定量推理能力

多模态理解

能够同时处理图像和文本输入，进行复杂的视觉语言推理

具身AI优化

特别适合机器人、无人机等需要空间规划和导航的应用场景

模型能力

定量空间推理

距离估计

物体关系分析

视觉问答

3D场景理解

多模态推理

使用案例

机器人导航

环境空间分析

帮助机器人理解周围环境中物体的空间关系

提高导航和避障能力

无人机应用

空中距离估计

估计无人机与地面或空中物体的距离

提升飞行安全性和任务规划能力

增强现实

虚拟物体放置

分析真实场景的空间特性来合理放置虚拟物体

提高AR体验的真实感

🚀 SpaceThinker-Qwen2.5VL-3B

SpaceThinker-Qwen2.5VL-3B 是一款多模态/视觉语言模型（VLM），专注于思考和推理能力。它通过在合成推理轨迹上微调基础模型，增强了空间推理能力，可广泛应用于需要空间规划和导航的具身AI领域。

🚀 快速开始

在线体验：点击在线试用 SpaceThinker。
本地运行：
- 使用 llama.cpp：安装并构建此分支，并从这里下载 .gguf 权重。

./llama-qwen2vl-cli -m spacethinker-qwen2.5VL-3B-F16.gguf
--mmproj spacethinker-qwen2.5vl-3b-vision.gguf
--image images/example_1.jpg --threads 24 -ngl 9
-p "Does the man in blue shirt working have a greater \\
height compared to the wooden pallet with boxes on floor?"

- **使用 Colab 中的 llama.cpp**：点击 [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1_ShhJAqnac8L4N9o1YNdsxCksSLJCrU7?usp=sharing) 在 Colab 中运行。
- **使用 Transformers**：

import torch
from PIL import Image
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
import requests
from io import BytesIO

# Configuration
model_id = "remyxai/SpaceThinker-Qwen2.5VL-3B"
image_path = "images/example_1.jpg"  # or local path
prompt = "What can you infer from this image about the environment?"
system_message = (
  "You are VL-Thinking 🤔, a helpful assistant with excellent reasoning ability. "
  "You should first think about the reasoning process and then provide the answer. "
  "Use <think>...</think> and <answer>...</answer> tags."
)

# Load model and processor
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    model_id, device_map="auto", torch_dtype=torch.bfloat16
)
processor = AutoProcessor.from_pretrained(model_id)

# Load and preprocess image
if image_path.startswith("http"):
    image = Image.open(BytesIO(requests.get(image_path).content)).convert("RGB")
else:
    image = Image.open(image_path).convert("RGB")
if image.width > 512:
    ratio = image.height / image.width
    image = image.resize((512, int(512 * ratio)), Image.Resampling.LANCZOS)

# Format input
chat = [
    {"role": "system", "content": [{"type": "text", "text": system_message}]},
    {"role": "user", "content": [{"type": "image", "image": image},
                                {"type": "text", "text": prompt}]}
]
text_input = processor.apply_chat_template(chat, tokenize=False,
                                                  add_generation_prompt=True)

# Tokenize
inputs = processor(text=[text_input], images=[image],
                                      return_tensors="pt").to("cuda")

# Generate response
generated_ids = model.generate(**inputs, max_new_tokens=1024)
output = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]

print("Response:\n", output)

✨ 主要特性

多模态与视觉语言融合：作为多模态、视觉语言模型，能够处理图像和文本信息。
强大的空间推理能力：经过训练，增强了定量空间推理能力，如距离估计、物体空间关系判断等。
基于测试时计算的优化：通过测试时计算，提升模型在空间推理任务中的表现。

📦 安装指南

使用 llama.cpp

安装并构建此分支，并从这里下载 .gguf 权重。

使用 Transformers

确保安装了 torch、transformers、Pillow、requests 等必要库。

💻 使用示例

基础用法

import torch
from PIL import Image
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
import requests
from io import BytesIO

# Configuration
model_id = "remyxai/SpaceThinker-Qwen2.5VL-3B"
image_path = "images/example_1.jpg"  # or local path
prompt = "What can you infer from this image about the environment?"
system_message = (
  "You are VL-Thinking 🤔, a helpful assistant with excellent reasoning ability. "
  "You should first think about the reasoning process and then provide the answer. "
  "Use <think>...</think> and <answer>...</answer> tags."
)

# Load model and processor
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    model_id, device_map="auto", torch_dtype=torch.bfloat16
)
processor = AutoProcessor.from_pretrained(model_id)

# Load and preprocess image
if image_path.startswith("http"):
    image = Image.open(BytesIO(requests.get(image_path).content)).convert("RGB")
else:
    image = Image.open(image_path).convert("RGB")
if image.width > 512:
    ratio = image.height / image.width
    image = image.resize((512, int(512 * ratio)), Image.Resampling.LANCZOS)

# Format input
chat = [
    {"role": "system", "content": [{"type": "text", "text": system_message}]},
    {"role": "user", "content": [{"type": "image", "image": image},
                                {"type": "text", "text": prompt}]}
]
text_input = processor.apply_chat_template(chat, tokenize=False,
                                                  add_generation_prompt=True)

# Tokenize
inputs = processor(text=[text_input], images=[image],
                                      return_tensors="pt").to("cuda")

# Generate response
generated_ids = model.generate(**inputs, max_new_tokens=1024)
output = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]

print("Response:\n", output)

高级用法

在实际应用中，可以根据具体需求调整输入的图像、提示信息和系统消息，以满足不同的空间推理任务。

📚 详细文档

模型概述

SpaceThinker-Qwen2.5VL-3B 是通过在 VQASynth 管道生成的合成推理轨迹上微调 UCSC-VLAA/VLAA-Thinker-Qwen2.5VL-3B 而得到的多模态/视觉语言模型。

属性	详情
模型类型	多模态、视觉语言模型
架构	`Qwen2.5-VL-3B`
模型大小	37.5亿参数（FP16）
微调基础模型	`UCSC-VLAA/VLAA-Thinker-Qwen2.5VL-3B`
微调策略	LoRA（低秩适应）
许可证	Apache-2.0

空间推理能力

强大的定量空间推理能力对于需要在3D空间中进行规划和导航的具身AI应用（如机器人和无人机）至关重要。SpaceThinker 通过测试时计算提升了能力，其训练样本基于对场景观察的一致解释来确定最终响应。

增强的定量空间推理：能够准确估计距离、大小等空间信息。
基于实际的物体关系判断：如判断物体的上下、左右、包含等关系。

示例展示


提示：戴红帽子的男人离装着箱子的托盘有多远（英尺）？	提示：在这张图片中，球门离罚球者有多远？	提示：椅子和书架大约相隔多少米？
推理过程：要确定戴红帽子的男人离装着箱子的托盘有多远，我需要考虑仓库中的空间布局。男人在地板上行走，他和托盘之间有明显的间隙。首先，我会估计托盘的大小。一个标准托盘通常约4英尺宽、8英尺长。鉴于男人站在托盘边缘附近，他可能离托盘前缘约2 - 3英尺。接下来，我会考虑从托盘到男人的距离。男人似乎沿着过道行走，过道看起来约6 - 7英尺宽。如果他站在托盘边缘，他可能离托盘前缘约2 - 3英尺。综合这些观察，戴红帽子的男人离装着箱子的托盘大约2 - 3英尺。	推理过程：首先，我观察到一名穿红色制服的球员站在球门附近的场景。球门本身是一个标准的足球门柱，通常超出球门线2.5米。球员似乎站在靠近球门的位置，但不是直接在球门线上。为了估计距离，我考虑到球员相对于球门的位置表明他可能站在距离球门不到2.5米的地方。鉴于点球场景的典型设置，球员会比球门本身更靠后。通过视觉估计，球员似乎站在距离球门约10 - 15米的地方。这个范围与点球的常见距离相符，考虑到球员的站位和球门的结构。因此，球门到罚球者的距离可能约为10 - 15米。	推理过程：为了确定椅子和书架之间的距离，我会考虑这些物体的典型尺寸。一把标准办公椅约60 - 70厘米高，书架的高度可以在1.2到1.8米之间。鉴于椅子可能放在桌子或地板上，椅子和书架顶部之间的高度差约为1米。假设书架至少1米高，椅子在地面上，它们之间的垂直距离确实是1米。水平距离可能会根据布局而有所不同，但考虑到房间的尺寸和椅子相对于书架的位置，椅子和书架顶部之间的总距离可能约为1米。因此，椅子和书架大约相隔1米。
答案：2 - 3英尺	答案：10 - 15米	答案：1米

数据集信息

SpaceThinker 数据集包含约12K个合成的空间推理轨迹，由 VQASynth 在 the cauldron 的局部叙事分割图像子集上合成。

属性	详情
合成轨迹数量	约12K个合成空间推理轨迹
问题类型	空间关系（距离（单位）、上方、左侧、包含、最接近等）
数据格式	图像（RGB）+ 问题 + 带有推理轨迹的答案
数据集链接	remyxai/SpaceThinker
代码链接	Synthetize Spatial Reasoning Traces with VQASynth

训练配置

PEFT 配置

属性	详情
架构	Qwen2.5-VL-3B
基础模型	UCSC-VLAA/VLAA-Thinker-Qwen2.5VL-3B
微调方法	LoRA 微调（PEFT）
LoRA Alpha	256
LoRA Rank	128
目标模块	q_proj, v_proj
优化器	AdamW（学习率 = 2e-5），批量大小 = 1，训练轮数 = 3
最大输入长度	1024 个令牌

可以使用以下脚本重现 LoRA SFT 训练：

python train.py

Wandb 日志可在这里查看。

模型评估

使用 Q-Spatial-Bench 数据集对 SpaceThinker 进行评估，该数据集包含数百个高精度的视觉问答样本，用于评估视觉语言模型的定量空间推理能力。

默认系统提示：完成提示 93 / 101，正确答案 30 个，准确率 32.26%。
使用逐步推理提示：使用 Q-Spatial-Bench 中的空间提示，正确答案 53 个，准确率 52.48%。

使用空间提示可以提高正确答案的数量和整体准确率，同时提高任务完成率。

QSpatial++ 比较表（4/25/25）

模型	SpaceThinker-Qwen2.5VL-3B	gpt-4o	gemini-2.5-pro-preview-03-25
QSpatial++ 预测示例
Colab 笔记本链接
成功率（%）↑	55	43	52
完成样本数 ↑	99 / 100	95 / 100	99 / 100
对称平均绝对百分比误差（%）↓	66	71	62

指标说明

成功率（%）：越高越好。
完成样本数：越高越好。
对称平均绝对百分比误差（%）：越低越好。

🔧 技术细节

模型架构

基于 Qwen2.5-VL-3B 架构，通过 LoRA 微调对基础模型 UCSC-VLAA/VLAA-Thinker-Qwen2.5VL-3B 进行优化。

训练过程

使用合成的空间推理轨迹数据进行训练，采用 LoRA 微调方法，设置了特定的 LoRA 参数（Alpha = 256，Rank = 128）和目标模块（q_proj, v_proj）。优化器使用 AdamW，学习率为 2e-5，批量大小为 1，训练轮数为 3。

推理机制

在推理过程中，模型结合图像和文本输入，通过测试时计算进行空间推理，使用 <think>...</think> 和 <answer>...</answer> 标签来组织推理过程和输出答案。

📄 许可证

本项目采用 Apache-2.0 许可证。

⚠️ 局限性

环境适应性：在杂乱环境或特定相机视角下，模型性能可能会下降。
数据局限性：模型是在互联网图像数据集上使用合成推理进行微调的，可能存在一定的偏差。
基础模型偏差：基础模型（Qwen2.5-VL）固有的多模态偏差可能仍然存在。
适用范围：不适合用于安全关键或法律决策场景。

建议用户批判性地评估模型输出，并考虑针对特定领域进行微调，以提高安全性和性能。使用自回归变压器估计的距离可用于规划和行为的高阶推理，但不能替代高精度传感器、校准立体视觉系统或能够进行更准确像素级预测和实时性能的专业单目深度估计模型的测量结果。

📜 引用

@article{chen2024spatialvlm,
  title = {SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities},
  author = {Chen, Boyuan and Xu, Zhuo and Kirmani, Sean and Ichter, Brian and Driess, Danny and Florence, Pete and Sadigh, Dorsa and Guibas, Leonidas and Xia, Fei},
  journal = {arXiv preprint arXiv:2401.12168},
  year = {2024},
  url = {https://arxiv.org/abs/2401.12168},
}

@misc{qwen2.5-VL,
  title = {Qwen2.5-VL},
  url = {https://qwenlm.github.io/blog/qwen2.5-vl/},
  author = {Qwen Team},
  month = {January},
  year = {2025}
}

@misc{vl-thinking2025,
  title={SFT or RL? An Early Investigation into Training R1-Like Reasoning Large Vision-Language Models },
  author={Hardy Chen and Haoqin Tu and Fali Wang and Hui Liu and Xianfeng Tang and Xinya Du and Yuyin Zhou and Cihang Xie},
  year = {2025},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/UCSC-VLAA/VLAA-Thinking}},
}

@inproceedings{
  liaos2024reasoning,
  title={Reasoning Paths with Reference Objects Elicit Quantitative Spatial Reasoning in Large Vision-Language Models},
  author={Yuan-Hong Liao and Rafid Mahmood and Sanja Fidler and David Acuna},
  booktitle={The 2024 Conference on Empirical Methods in Natural Language Processing},
  year={2024},
  url={https://arxiv.org/abs/2409.09788},
}