Model Overview
Model Features
Model Capabilities
Use Cases
đ SpatialVLA Fine-Tuned on fractal & bridge
This model is generated by fine-tuning the SpatialVLA model on the bridge dataset for the Simpler-env benchmark, offering high - performance vision - language - action capabilities.
đ Quick Start
SpatialVLA relies solely on HuggingFace Transformers đ¤, making deployment extremely easy. If your environment supports transformers >= 4.47.0
, you can directly use the following code to load the model and perform inference. (requires 8.5GB of GPU memory).
import torch
from PIL import Image
from transformers import AutoModel, AutoProcessor
model_name_or_path="IPEC-COMMUNITY/spatialvla-4b-224-pt"
processor = AutoProcessor.from_pretrained(model_name_or_path, trust_remote_code=True)
model = AutoModel.from_pretrained(model_name_or_path, trust_remote_code=True, torch_dtype=torch.bfloat16).eval().cuda()
image = Image.open("example.png").convert("RGB")
prompt = "What action should the robot take to pick the cup?"
inputs = processor(images=[image], text=prompt, return_tensors="pt")
generation_outputs = model.predict_action(inputs)
actions = processor.decode_actions(generation_outputs, unnorm_key="bridge_orig/1.0.0")
print(actions)
⨠Features
- Multidisciplinary Team Development: Developed by the SpatialVLA team composed of researchers from Shanghai AI Laboratory, ShanghaiTech and TeleAI.
- Advanced Model Type: Vision - language - action model, converting language and image inputs into robot actions.
- Rich Data Support: Pretrained on large - scale datasets such as [Open X - Embodiment](https://robotics - transformer - x.github.io/) and RH20T.
- Easy Deployment: Relying on HuggingFace Transformers, it can be quickly deployed in environments that meet the requirements.
đĻ Installation
If you want to use the model for fine - tuning or pre - training, you need to follow these steps:
Clone the Repository
git clone https://github.com/SpatialVLA/SpatialVLA.git
Create a Python Environment
conda create -n spatialvla python=3.10
conda activate spatialvla
Install Required Packages
pip install -r requirements.txt
Note that we use a customised dlimp
to support seed setting for reproducibility. If you catch any problems, please manually install the dlimp form the dlimp_custom.
đģ Usage Examples
Basic Usage
import torch
from PIL import Image
from transformers import AutoModel, AutoProcessor
model_name_or_path="IPEC-COMMUNITY/spatialvla-4b-224-pt"
processor = AutoProcessor.from_pretrained(model_name_or_path, trust_remote_code=True)
model = AutoModel.from_pretrained(model_name_or_path, trust_remote_code=True, torch_dtype=torch.bfloat16).eval().cuda()
image = Image.open("example.png").convert("RGB")
prompt = "What action should the robot take to pick the cup?"
inputs = processor(images=[image], text=prompt, return_tensors="pt")
generation_outputs = model.predict_action(inputs)
actions = processor.decode_actions(generation_outputs, unnorm_key="bridge_orig/1.0.0")
print(actions)
Advanced Usage
Train from Scratch
SpatialVLA is pre - trained with 1.1 Million real - robot demonstrations from the OXE and RH20T dataset on a cluster of 64 A100 GPUs for about 10 days, using a batch size of 2048. You can pre - train the model from scratch using the following command.
# torchrun
bash scripts/spatialvla_4b_pretrain/torchrun_pretrain.sh
# or in a slurm cluster
bash scripts/spatialvla_4b_pretrain/slurm_pretrain.sh
Fine - tuning
Most of our fine - tuning experiments are conducted using LoRA on 4 or 8 A100 GPUs. You can use the following scripts for full - parameter or LoRA fine - tuning. For real - world experiments with small datasets, we prefer using LoRA for fine - tuning.
# full fine - tuning
bash scripts/spatialvla_4b_finetune/finetune_full.sh
# LoRA fine - tuning
bash scripts/spatialvla_4b_finetune/finetune_lora.sh
đ Documentation
Model Details
Property | Details |
---|---|
Developed by | The SpatialVLA team consisting of researchers from Shanghai AI Laboratory, ShanghaiTech and TeleAI. |
Model Type | Vision - language - action (language, image => robot actions) |
Language(s) (NLP) | en |
License | MIT |
Finetuned from model | [paligemma2 - 3b - pt - 224](https://huggingface.co/google/paligemma2 - 3b - pt - 224) |
Pretraining Dataset | [Open X - Embodiment](https://robotics - transformer - x.github.io/) and RH20T |
Repository | https://github.com/SpatialVLA/SpatialVLA |
Paper | SpatialVLA: Exploring Spatial Representations for Visual - Language - Action Model |
Project Page & Videos | https://spatialvla.github.io/ |
Evaluation
SimplerEnv evaluation on Google Robot tasks
Model | Visual Matching - Pick Coke Can | Visual Matching - Move Near | Visual Matching - Open/Close Drawer | Visual Matching - #Average | Variant Aggregation - Pick Coke Can | Variant Aggregation - Move Near | Variant Aggregation - Open/Close Drawer | Variant Aggregation - #Average |
---|---|---|---|---|---|---|---|---|
RT - 1 (Begin) | 2.7% | 5.0% | 13.9% | 6.8% | 2.2% | 4.0% | 6.9% | 4.2% |
RT - 1 (15%) | 71.0% | 35.4% | 56.5% | 60.2% | 81.3% | 44.6% | 26.7% | 56.2% |
RT - 1 (Converged) | 85.7% | 44.2% | 73.0% | 74.6% | 89.8% | 50.0% | 32.3% | 63.3% |
HPT | 56.0% | 60.0% | 24.0% | 46.0% | -- | -- | 31.0% | 45.0% |
TraceVLA | 28.0% | 53.7% | 57.0% | 42.0% | 60.0% | 56.4% | 29.4% | 39.6% |
RT - 1 - X | 56.7% | 31.7% | 59.7% | 53.4% | 49.0% | 32.3% | 35.3% | 64.3% |
RT - 2 - X | 78.7% | 77.9% | 25.0% | 60.7% | 82.3% | 79.2% | -- | -- |
Octo - Base | 17.0% | 4.2% | 22.7% | 16.8% | 0.6% | 3.1% | 1.1% | 1.1% |
OpenVLA | 16.3% | 46.2% | 35.6% | 27.7% | 54.5% | 47.7% | 17.7% | 39.8% |
RoboVLM (zero - shot) | 72.7% | 66.3% | 26.8% | 56.3% | 68.3% | 56.0% | 8.5% | 46.3% |
RoboVLM (fine - tuning) | 77.3% | 61.7% | 43.5% | 63.4% | 75.6% | 60.0% | 10.6% | 51.3% |
SpatialVLA (zero - shot) | 81.0% | 69.6% | 59.3% | 71.9% | 89.5% | 71.7% | 36.2% | 68.8% |
SpatialVLA (fine - tuning) | 86.0% | 77.9% | 57.4% | 75.1% | 88.0% | 72.7% | 41.8% | 70.7% |
SimplerEnv evaluation on WidowX Robot tasks
Model | Put Spoon on Towel - Grasp Spoon | Put Spoon on Towel - Success | Put Carrot on Plate - Grasp Carrot | Put Carrot on Plate - Success | Stack Green Block on Yellow Block - Grasp Green Block | Stack Green Block on Yellow Block - Success | Put Eggplant in Yellow Basket - Grasp Eggplant | Put Eggplant in Yellow Basket - Success | #Overall Average |
---|---|---|---|---|---|---|---|---|---|
RT - 1 - X | 16.7% | 0.0% | 20.8% | 4.2% | 8.3% | 0.0% | 0.0% | 0.0% | 1.1% |
Octo - Base | 34.7% | 12.5% | 52.8% | 8.3% | 31.9% | 0.0% | 66.7% | 43.1% | 16.0% |
Octo - Small | 77.8% | 47.2% | 27.8% | 9.7% | 40.3% | 4.2% | 87.5% | 56.9% | 30.0% |
OpenVLA | 4.1% | 0.0% | 33.3% | 0.0% | 12.5% | 0.0% | 8.3% | 4.1% | 1.0% |
RoboVLM (zero - shot) | 37.5% | 20.8% | 33.3% | 25.0% | 8.3% | 8.3% | 0.0% | 0.0% | 13.5% |
RoboVLM (fine - tuning) | 54.2% | 29.2% | 25.0% | 25.0% | 45.8% | 12.5% | 58.3% | 58.3% | 31.3% |
SpatialVLA (zero - shot) | 25.0% | 20.8% | 41.7% | 20.8% | 58.3% | 25.0% | 79.2% | 70.8% | 34.4% |
SpatialVLA (fine - tuning) | 20.8% | 16.7% | 29.2% | 25.0% | 62.5% | 29.2% | 100.0% | 100.0% | 42.7% |
Zero - shot Robot Control Evaluation on WidowX Robot

Spatial Understanding Capability Evaluation

đ License
This model is licensed under the MIT license.
đ§ Technical Details
The model is fine - tuned from [paligemma2 - 3b - pt - 224](https://huggingface.co/google/paligemma2 - 3b - pt - 224) on the bridge dataset for the Simpler - env benchmark. It is a vision - language - action model that can convert language and image inputs into robot actions. The model is pre - trained on large - scale datasets such as [Open X - Embodiment](https://robotics - transformer - x.github.io/) and RH20T, and can be easily deployed using HuggingFace Transformers.
đ Citation
BibTeX:
@misc{qu2025spatialvlaexploringspatialrepresentations,
title={SpatialVLA: Exploring Spatial Representations for Visual - Language - Action Model},
author={Delin Qu and Haoming Song and Qizhi Chen and Yuanqi Yao and Xinyi Ye and Yan Ding and Zhigang Wang and JiaYuan Gu and Bin Zhao and Dong Wang and Xuelong Li},
year={2025},
eprint={2501.15830},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2501.15830},
}







