🚀 SpaceLLaVA
SpaceLLaVA is a multimodal vision - language model. It is adapted from LLaVA - 1.5 (13B) and fine - tuned to enhance spatial reasoning capabilities. It uses a synthetic VQA dataset for training and shows strong qualitative and quantitative spatial reasoning abilities.
🚀 Quick Start
GGUF
You can use this notebook to query spatial relationships between objects in a scene with llama - cpp - python.

Docker
docker build -f Dockerfile -t spacellava-server:latest
docker run -it --rm --gpus all -p8000:8000 -p8001:8001 -p8002:8002 --shm-size 12G spacellava-server:latest
python3 client.py --image_path "https://remyx.ai/assets/spatialvlm/warehouse_rgb.jpg" --prompt "What is the distance between the man in the red hat and the pallet of boxes?"
✨ Features
- Multimodal Capability: It is a vision - language model, enabling it to handle both visual and textual information.
- Spatial Reasoning: Fine - tuned to improve spatial reasoning, which can perform well in qualitative and quantitative spatial reasoning tasks.
- Finetuning Strategy: Utilizes LoRA (Low - Rank Adaptation) for finetuning from the base model
liuhaotian/llava - v1.5 - 13b
.
📦 Installation
The installation steps mainly involve using Docker. The commands are as follows:
docker build -f Dockerfile -t spacellava-server:latest
docker run -it --rm --gpus all -p8000:8000 -p8001:8001 -p8002:8002 --shm-size 12G spacellava-server:latest
📚 Documentation
Model Overview
- Model Type: Multimodal, Vision - Language Model
- Architecture:
llava - v1.5 - 13b
- Model Size: 13.4B parameters (FP16)
- Finetuned from: liuhaotian/llava - v1.5 - 13b
- Finetune Strategy: LoRA (Low - Rank Adaptation)
- License: Apache - 2.0
Dataset & Training
- Dataset: SpaceLLaVA
- Code: VQASynth
- Reference: [SpatialVLM](https://spatial - vlm.github.io/)
The dataset contains about 28,000 synthetic samples created using templated VQA pairs with a 3D scene reconstruction pipeline. The formats include image (RGB), question (text), and answer (text). The spatial relation types include “distances”, “size”, “left of”, “above”, “closer to”, “inside”.
Scripts for LoRA SFT are available at trl. You can also check out the [SpaceVLMs collection](https://huggingface.co/collections/remyxai/spacevlms - 66a3dbb924756d98e7aec678).
Model Evaluation (Coming Soon)
TODO: VLMEvalKit evaluation on the QSpatial benchmark, VSR, etc.
You can try it on Discord: http://discord.gg/b2yGuCNpuC
⚠️ Limitations & Ethical Considerations
⚠️ Important Note
- Performance may degrade in cluttered environments or camera perspective.
- This model was fine - tuned using synthetic reasoning over an internet image dataset.
- Multimodal biases inherent to the base model (LLaVA) may persist.
- Not intended for use in safety - critical or legal decision - making.
💡 Usage Tip
Users are encouraged to evaluate outputs critically and consider fine - tuning for domain - specific safety and performance.
License and Citation
Licensed under Apache - 2.0.
@article{chen2024spatialvlm,
title = {SpatialVLM: Endowing Vision - Language Models with Spatial Reasoning Capabilities},
author = {Chen, Boyuan and Xu, Zhuo and Kirmani, Sean and Ichter, Brian and Driess, Danny and Florence, Pete and Sadigh, Dorsa and Guibas, Leonidas and Xia, Fei},
journal = {arXiv preprint arXiv:2401.12168},
year = {2024},
url = {https://arxiv.org/abs/2401.12168},
}
@misc{liu2023llava,
title={Visual Instruction Tuning},
author={Liu, Haotian and Li, Chunyuan and Wu, Qingyang and Lee, Yong Jae},
publisher={NeurIPS},
year={2023},
}


Property |
Details |
Model Type |
Multimodal, Vision - Language Model |
Architecture |
llava - v1.5 - 13b |
Model Size |
13.4B parameters (FP16) |
Finetuned from |
liuhaotian/llava - v1.5 - 13b |
Finetune Strategy |
LoRA (Low - Rank Adaptation) |
License |
Apache - 2.0 |