SpaceLLaVA Open-source Vision-Language Model - Improve Spatial Reasoning Ability, Suitable for Quantitative and Qualitative Tasks

Spacellava

Developed by remyxai

SpaceLLaVA is an improved vision-language model based on LLaVA-1.5, enhanced with LoRA fine-tuning for spatial reasoning capabilities, suitable for both quantitative and qualitative spatial reasoning tasks.

Image-to-Text EnglishOpen Source License:Apache-2.0 #Spatial Distance Estimation #3D Scene Understanding #Multimodal VQA

Downloads 324

Release Time : 3/4/2024

Model Overview

SpaceLLaVA is a multimodal vision-language model focused on spatial reasoning tasks such as distance estimation and object position relationship judgment. It is fine-tuned using synthetic VQA datasets to enhance 3D scene understanding capabilities.

Model Features

Enhanced Spatial Reasoning

Fine-tuned with synthetic VQA datasets, significantly improving the understanding and reasoning of spatial relationships between objects.

Multimodal Understanding

Capable of processing both visual and linguistic information for joint understanding of images and text.

LoRA Fine-tuning

Utilizes Low-Rank Adaptation for efficient fine-tuning while preserving the general capabilities of the base model.

Model Capabilities

Visual Question Answering

Spatial Relationship Reasoning

Distance Estimation

Object Position Judgment

Multimodal Understanding

Use Cases

Robot Navigation

Environmental Spatial Understanding

Helps robots understand the spatial relationships of objects in the environment

Improves navigation efficiency and safety

Augmented Reality

Virtual Object Placement

Determines reasonable positions for virtual objects in real-world scenes

Enhances the realism of AR experiences

🚀 SpaceLLaVA

SpaceLLaVA is a multimodal vision - language model. It is adapted from LLaVA - 1.5 (13B) and fine - tuned to enhance spatial reasoning capabilities. It uses a synthetic VQA dataset for training and shows strong qualitative and quantitative spatial reasoning abilities.

🚀 Quick Start

GGUF

You can use this notebook to query spatial relationships between objects in a scene with llama - cpp - python.

Docker

docker build -f Dockerfile -t spacellava-server:latest
docker run -it --rm --gpus all -p8000:8000 -p8001:8001 -p8002:8002 --shm-size 12G spacellava-server:latest
python3 client.py --image_path "https://remyx.ai/assets/spatialvlm/warehouse_rgb.jpg" --prompt "What is the distance between the man in the red hat and the pallet of boxes?"

✨ Features

Multimodal Capability: It is a vision - language model, enabling it to handle both visual and textual information.
Spatial Reasoning: Fine - tuned to improve spatial reasoning, which can perform well in qualitative and quantitative spatial reasoning tasks.
Finetuning Strategy: Utilizes LoRA (Low - Rank Adaptation) for finetuning from the base model liuhaotian/llava - v1.5 - 13b.

📦 Installation

The installation steps mainly involve using Docker. The commands are as follows:

docker build -f Dockerfile -t spacellava-server:latest
docker run -it --rm --gpus all -p8000:8000 -p8001:8001 -p8002:8002 --shm-size 12G spacellava-server:latest

📚 Documentation

Model Overview

Model Type: Multimodal, Vision - Language Model
Architecture: llava - v1.5 - 13b
Model Size: 13.4B parameters (FP16)
Finetuned from: liuhaotian/llava - v1.5 - 13b
Finetune Strategy: LoRA (Low - Rank Adaptation)
License: Apache - 2.0

Dataset & Training

Dataset: SpaceLLaVA
Code: VQASynth
Reference: [SpatialVLM](https://spatial - vlm.github.io/)

The dataset contains about 28,000 synthetic samples created using templated VQA pairs with a 3D scene reconstruction pipeline. The formats include image (RGB), question (text), and answer (text). The spatial relation types include “distances”, “size”, “left of”, “above”, “closer to”, “inside”.

Scripts for LoRA SFT are available at trl. You can also check out the [SpaceVLMs collection](https://huggingface.co/collections/remyxai/spacevlms - 66a3dbb924756d98e7aec678).

Model Evaluation (Coming Soon)

TODO: VLMEvalKit evaluation on the QSpatial benchmark, VSR, etc.

You can try it on Discord: http://discord.gg/b2yGuCNpuC

⚠️ Limitations & Ethical Considerations

⚠️ Important Note

Performance may degrade in cluttered environments or camera perspective.

This model was fine - tuned using synthetic reasoning over an internet image dataset.

Multimodal biases inherent to the base model (LLaVA) may persist.

Not intended for use in safety - critical or legal decision - making.

💡 Usage Tip

Users are encouraged to evaluate outputs critically and consider fine - tuning for domain - specific safety and performance.

License and Citation

Licensed under Apache - 2.0.

@article{chen2024spatialvlm,
  title = {SpatialVLM: Endowing Vision - Language Models with Spatial Reasoning Capabilities},
  author = {Chen, Boyuan and Xu, Zhuo and Kirmani, Sean and Ichter, Brian and Driess, Danny and Florence, Pete and Sadigh, Dorsa and Guibas, Leonidas and Xia, Fei},
  journal = {arXiv preprint arXiv:2401.12168},
  year = {2024},
  url = {https://arxiv.org/abs/2401.12168},
}

@misc{liu2023llava,
      title={Visual Instruction Tuning},
      author={Liu, Haotian and Li, Chunyuan and Wu, Qingyang and Lee, Yong Jae},
      publisher={NeurIPS},
      year={2023},
}

![image/gif](https://cdn - uploads.huggingface.co/production/uploads/647777304ae93470ffc28913/ZGZ0aNfZLxtdHaXN8F2ki.gif) ![image/png](https://cdn - uploads.huggingface.co/production/uploads/647777304ae93470ffc28913/Rsu5VpDgdZh9jemw97w8T.png)

Property	Details
Model Type	Multimodal, Vision - Language Model
Architecture	`llava - v1.5 - 13b`
Model Size	13.4B parameters (FP16)
Finetuned from	liuhaotian/llava - v1.5 - 13b
Finetune Strategy	LoRA (Low - Rank Adaptation)
License	Apache - 2.0

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご