SpaceQwen2.5-VL-3B-Instruct開源多模態模型 - 免費提升空間推理能力

首頁

Spaceqwen2.5 VL 3B Instruct

由remyxai開發

基於Qwen2.5-VL-3B-Instruct微調的多模態視覺語言模型，專注於空間推理能力

文本生成圖像英語開源協議:Apache-2.0 #空間推理 #具身智能 #多模態VLM

下載量 7,446

發布時間 : 1/29/2025

模型概述

該模型通過LoRA微調增強了空間推理能力，能夠處理與物體間空間關係相關的視覺問答任務，適用於機器人導航、具身智能等場景

模型特點

增強空間推理

通過合成數據訓練，專門優化了距離估計、方位判斷等空間推理能力

多模態理解

能夠同時處理圖像和文本輸入，理解視覺場景中的物體關係

輕量微調

採用LoRA方法進行高效微調，保持基礎模型能力的同時增加特定功能

模型能力

視覺問答

空間關係推理

距離估計

物體定位

多模態理解

使用案例

機器人導航

倉庫環境導航

幫助機器人理解倉庫環境中物體的空間關係

可準確回答關於物體位置和距離的問題

具身智能

環境交互

為具身智能體提供空間感知能力

使智能體能夠更好地與環境互動

🚀 SpaceQwen2.5-VL-3B-Instruct

SpaceQwen2.5-VL-3B-Instruct是一個多模態的視覺語言模型，它基於Qwen2.5-VL-3B-Instruct進行微調。該模型運用數據合成技術和公開可用模型，增強了多模態模型的空間推理能力，能夠推斷場景中物體間的空間關係。

🚀 快速開始

安裝依賴

Transformers

安裝Qwen依賴：

pip install qwen-vl-utils[decord]==0.0.8

運行推理

Transformers

對示例圖像進行推理：

from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "remyxai/SpaceQwen2.5-VL-3B-Instruct", torch_dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained("remyxai/SpaceQwen2.5-VL-3B-Instruct")

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://raw.githubusercontent.com/remyxai/VQASynth/refs/heads/main/assets/warehouse_sample_2.jpeg",
            },
            {"type": "text", "text": "What is the height of the man in the red hat in feet?"},
        ],
    }
]

# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

GGUF

使用 llama.cpp 運行 SpaceQwen2.5-VL-3B-Instruct：

./llama-qwen2vl-cli -m /path/to/SpaceQwen2.5-VL-3B-Instruct/SpaceQwen2.5-VL-3B-Instruct-F16.gguf \
                    --mmproj /path/to/SpaceQwen2.5-VL-3B-Instruct/spaceqwen2.5-vl-3b-instruct-vision.gguf \
                    -p "What's the height of the man in the red hat?" \
                    --image /path/to/warehouse_sample_2.jpeg --threads 24 -ngl 99

✨ 主要特性

多模態處理：結合視覺和語言信息，實現更復雜的場景理解。
空間推理能力：能夠推斷場景中物體間的空間關係，如距離、位置等。
數據合成技術：利用數據合成創建用於空間推理的VQA數據集。

📦 安裝指南

安裝Qwen依賴：

pip install qwen-vl-utils[decord]==0.0.8

💻 使用示例

基礎用法

對示例圖像進行推理：

from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "remyxai/SpaceQwen2.5-VL-3B-Instruct", torch_dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained("remyxai/SpaceQwen2.5-VL-3B-Instruct")

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://raw.githubusercontent.com/remyxai/VQASynth/refs/heads/main/assets/warehouse_sample_2.jpeg",
            },
            {"type": "text", "text": "What is the height of the man in the red hat in feet?"},
        ],
    }
]

# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

高級用法

使用 llama.cpp 運行 SpaceQwen2.5-VL-3B-Instruct：

./llama-qwen2vl-cli -m /path/to/SpaceQwen2.5-VL-3B-Instruct/SpaceQwen2.5-VL-3B-Instruct-F16.gguf \
                    --mmproj /path/to/SpaceQwen2.5-VL-3B-Instruct/spaceqwen2.5-vl-3b-instruct-vision.gguf \
                    -p "What's the height of the man in the red hat?" \
                    --image /path/to/warehouse_sample_2.jpeg --threads 24 -ngl 99

📚 詳細文檔

模型概述

該模型使用數據合成技術和公開可用模型，重現了SpatialVLM中描述的工作，以增強多模態模型的空間推理能力。通過專家模型管道，我們可以推斷場景中物體間的空間關係，創建用於空間推理的VQA數據集。

數據集與訓練

SpaceQwen2.5-VL-3B-Instruct 使用LoRA在 OpenSpaces 數據集上對 Qwen2.5-VL-3B-Instruct 進行微調。

數據集摘要：

約10k個合成空間推理軌跡
問題類型：空間關係（距離（單位）、上方、左側、包含、最接近）
格式：圖像（RGB）+ 問題 + 答案
數據集：OpenSpaces
代碼：VQASynth
參考：SpatialVLM

LoRA SFT腳本可在 trl 找到。

模型評估（即將推出）

請關注 VLMEvalKit QSpatial基準測試

計劃進行的比較：

🌋 SpaceLLaVA
🧑‍🏫 SpaceQwen2.5-VL-3B-Instruct
🤖 相關的用於機器人的VLM和VLA

你也可以在 Discord 或 HF空間上嘗試該模型。

🔧 技術細節

模型類型：多模態，視覺語言模型
架構：Qwen2.5-VL-3B-Instruct
模型大小：37.5億參數（FP16）
微調基礎：Qwen/Qwen2.5-VL-3B-Instruct
微調策略：LoRA（低秩適應）

屬性	詳情
模型類型	多模態，視覺語言模型
架構	Qwen2.5-VL-3B-Instruct
模型大小	37.5億參數（FP16）
微調基礎	Qwen/Qwen2.5-VL-3B-Instruct
微調策略	LoRA（低秩適應）
許可證	Apache-2.0

⚠️ 限制與倫理考量

⚠️ 重要提示

在雜亂環境或特定相機視角下，模型性能可能下降。

該模型使用互聯網圖像數據集上的合成推理進行微調。

基礎模型（Qwen2.5-VL）固有的多模態偏差可能仍然存在。

該模型不應用於安全關鍵或法律決策。

💡 使用建議

建議用戶批判性地評估模型輸出，並考慮針對特定領域的安全性和性能進行微調。

📄 許可證

本項目採用Apache-2.0許可證。

引用

@article{chen2024spatialvlm,
  title = {SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities},
  author = {Chen, Boyuan and Xu, Zhuo and Kirmani, Sean and Ichter, Brian and Driess, Danny and Florence, Pete and Sadigh, Dorsa and Guibas, Leonidas and Xia, Fei},
  journal = {arXiv preprint arXiv:2401.12168},
  year = {2024},
  url = {https://arxiv.org/abs/2401.12168},
}

@misc{qwen2.5-VL,
    title = {Qwen2.5-VL},
    url = {https://qwenlm.github.io/blog/qwen2.5-vl/},
    author = {Qwen Team},
    month = {January},
    year = {2025}
}