SpatialVLA-4B-224-PT Open-source Model - A Space-Enhanced Vision-Language Tool for Robot Control Tasks

Spatialvla 4b 224 Pt

Developed by IPEC-COMMUNITY

SpatialVLA is a spatial enhanced vision - language - action model trained on 1.1 million real robot operation segments, focusing on robot control tasks.

Multimodal Fusion

Transformers

EnglishOpen Source License:MIT #Robot motion prediction #Spatial enhanced VLA #Real - world scenario training

Downloads 13.06k

Release Time : 1/26/2025

Model Overview

A vision - language - action model based on the PaLiGemma2 architecture, capable of generating robot control actions according to visual input and language instructions.

Model Features

Spatial enhanced representation

Specifically optimizes spatial understanding ability to better handle spatial relationships in robot operation tasks.

Large - scale real - world data training

Trained on 1.1 million real robot operation segments, with strong practical operation ability.

Concise and efficient implementation

Fully implemented based on HuggingFace Transformers, easy to deploy.

Model Capabilities

Visual instruction understanding

Robot motion generation

Spatial relationship reasoning

Multimodal task processing

Use Cases

Robot control

Object grasping

Generate a sequence of actions to grasp an object according to visual input and language instructions.

Achieve zero - shot control on the WidowX robot.

New configuration adaptation

Adapt to a new robot configuration through a small amount of fine - tuning.

Successfully applied to the Franka robot.

Spatial understanding

Spatial relationship reasoning

Understand the spatial relationships between objects and generate corresponding actions.

Perform excellently in the LIBERO benchmark test.

🚀 SpatialVLA

SpatialVLA is a spatial-enhanced vision-language-action model trained on 1.1 million real robot episodes. The code, based purely on HuggingFace, is concise and offers efficient performance. All SpatialVLA checkpoints and our training codebase are released under an MIT License. For full details, please read our paper and visit our project page.

🚀 Quick Start

SpatialVLA relies solely on HuggingFace Transformers 🤗, making deployment extremely easy. If your environment supports transformers >= 4.47.0, you can directly use the following code to load the model and perform inference (requires 8.5GB of GPU memory).

import torch
from PIL import Image
from transformers import AutoModel, AutoProcessor

model_name_or_path="IPEC-COMMUNITY/spatialvla-4b-224-pt"
processor = AutoProcessor.from_pretrained(model_name_or_path, trust_remote_code=True)

model = AutoModel.from_pretrained(model_name_or_path, trust_remote_code=True, torch_dtype=torch.bfloat16).eval().cuda()

image = Image.open("example.png").convert("RGB")
prompt = "What action should the robot take to pick the cup?"
inputs = processor(images=[image], text=prompt, return_tensors="pt")
generation_outputs = model.predict_action(inputs)

actions = processor.decode_actions(generation_outputs, unnorm_key="bridge_orig/1.0.0")
print(actions)

✨ Features

Spatial Enhancement: Trained on 1.1 million real robot episodes, SpatialVLA is a spatial-enhanced vision-language-action model.
Easy Deployment: Relying solely on HuggingFace Transformers, it simplifies the deployment process.
MIT License: All checkpoints and the training codebase are released under an MIT License.

📦 Installation

If you want to use the model for fine-tuning or pre-training, you need to clone the official repository first.

git clone https://github.com/SpatialVLA/SpatialVLA.git

Then, install the required packages and download the model from the Hugging Face model hub. The VLM backbone of SpatialVLA is PaLiGemma2, which requires transformers >= 4.47.0. Hence, create a Python environment with Python >= 3.10.

conda create -n spatialvla python=3.10
conda activate spatialvla

Install packages from the requirements.txt file. Note that we use a customised dlimp to support seed setting for reproducibility. If you encounter any problems, please manually install dlimp from dlimp_custom.

pip install -r requirements.txt

Train from Scratch

SpatialVLA is pre-trained with 1.1 million real-robot demonstrations from the OXE and RH20T dataset on a cluster of 64 A100 GPUs for about 10 days, using a batch size of 2048. You can pre-train the model from scratch using the following command.

# torchrun
bash scripts/spatialvla_4b_pretrain/torchrun_pretrain.sh

# or in a slurm cluster
bash scripts/spatialvla_4b_pretrain/slurm_pretrain.sh

Fine-tuning

Most of our fine-tuning experiments are conducted using LoRA on 4 or 8 A100 GPUs. You can use the following scripts for full-parameter or LoRA fine-tuning. For real-world experiments with small datasets, we prefer using LoRA for fine-tuning.

# full fine-tuning
bash scripts/spatialvla_4b_finetune/finetune_full.sh

# LoRA fine-tuning
bash scripts/spatialvla_4b_finetune/finetune_lora.sh

📚 Documentation

Model Details

Model Description

Property	Details
Developed by	The SpatialVLA team consisting of researchers from Shanghai AI Laboratory, ShanghaiTech and TeleAI.
Model Type	Vision-language-action (language, image => robot actions)
Language(s) (NLP)	en
License	MIT
Finetuned from model	paligemma2-3b-pt-224
Pretraining Dataset	Open X-Embodiment and RH20T
Repository	https://github.com/SpatialVLA/SpatialVLA
Paper	SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model
Project Page & Videos	https://spatialvla.github.io/

Uses

Direct Use

The provided code snippet shows how to load the model and perform inference directly.

Out-of-Scope Use

SpatialVLA models do not zero-shot generalize to new (unseen) robot embodiments, or setups that are not represented in the pretraining mix. In these cases, we suggest collecting a dataset of demonstrations on the desired setup and fine-tuning the SpatialVLA models.

🔧 Technical Details

Evaluation

SimplerEnv evaluation on Google Robot tasks: | Model | Visual Matching - Pick Coke Can | Visual Matching - Move Near | Visual Matching - Open/Close Drawer | Visual Matching - #Average | Variant Aggregation - Pick Coke Can | Variant Aggregation - Move Near | Variant Aggregation - Open/Close Drawer | Variant Aggregation - #Average | | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | | RT-1 (Begin) | 2.7% | 5.0% | 13.9% | 6.8% | 2.2% | 4.0% | 6.9% | 4.2% | | RT-1 (15%) | 71.0% | 35.4% | 56.5% | 60.2% | 81.3% | 44.6% | 26.7% | 56.2% | | RT-1 (Converged) | 85.7% | 44.2% | 73.0% | 74.6% | 89.8% | 50.0% | 32.3% | 63.3% | | HPT | 56.0% | 60.0% | 24.0% | 46.0% | -- | -- | 31.0% | 45.0% | | TraceVLA | 28.0% | 53.7% | 57.0% | 42.0% | 60.0% | 56.4% | 29.4% | 39.6% | | RT-1-X | 56.7% | 31.7% | 59.7% | 53.4% | 49.0% | 32.3% | 35.3% | 64.3% | | RT-2-X | 78.7% | 77.9% | 25.0% | 60.7% | 82.3% | 79.2% | -- | -- | | Octo-Base | 17.0% | 4.2% | 22.7% | 16.8% | 0.6% | 3.1% | 1.1% | 1.1% | | OpenVLA | 16.3% | 46.2% | 35.6% | 27.7% | 54.5% | 47.7% | 17.7% | 39.8% | | RoboVLM (zero-shot) | 72.7% | 66.3% | 26.8% | 56.3% | 68.3% | 56.0% | 8.5% | 46.3% | | RoboVLM (fine-tuning) | 77.3% | 61.7% | 43.5% | 63.4% | 75.6% | 60.0% | 10.6% | 51.3% | | SpatialVLA (zero-shot) | 81.0% | 69.6% | 59.3% | 71.9% | 89.5% | 71.7% | 36.2% | 68.8% | | SpatialVLA (fine-tuning) | 86.0% | 77.9% | 57.4% | 75.1% | 88.0% | 72.7% | 41.8% | 70.7% |
SimplerEnv evaluation on WidowX Robot tasks: | Model | Put Spoon on Towel - Grasp Spoon | Put Spoon on Towel - Success | Put Carrot on Plate - Grasp Carrot | Put Carrot on Plate - Success | Stack Green Block on Yellow Block - Grasp Green Block | Stack Green Block on Yellow Block - Success | Put Eggplant in Yellow Basket - Grasp Eggplant | Put Eggplant in Yellow Basket - Success | #Overall Average | | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | | RT-1-X | 16.7% | 0.0% | 20.8% | 4.2% | 8.3% | 0.0% | 0.0% | 0.0% | 1.1% | | Octo-Base | 34.7% | 12.5% | 52.8% | 8.3% | 31.9% | 0.0% | 66.7% | 43.1% | 16.0% | | Octo-Small | 77.8% | 47.2% | 27.8% | 9.7% | 40.3% | 4.2% | 87.5% | 56.9% | 30.0% | | OpenVLA | 4.1% | 0.0% | 33.3% | 0.0% | 12.5% | 0.0% | 8.3% | 4.1% | 1.0% | | RoboVLM (zero-shot) | 37.5% | 20.8% | 33.3% | 25.0% | 8.3% | 8.3% | 0.0% | 0.0% | 13.5% | | RoboVLM (fine-tuning) | 54.2% | 29.2% | 25.0% | 25.0% | 45.8% | 12.5% | 58.3% | 58.3% | 31.3% | | SpatialVLA (zero-shot) | 25.0% | 20.8% | 41.7% | 20.8% | 58.3% | 25.0% | 79.2% | 70.8% | 34.4% | | SpatialVLA (fine-tuning) | 20.8% | 16.7% | 29.2% | 25.0% | 62.5% | 29.2% | 100.0% | 100.0% | 42.7% |
LIBERO Simulation Benchmark Results: | Model | LIBERO - Spatial - SR (↑) | LIBERO - Spatial - Rank (↓) | LIBERO - Object - SR (↑) | LIBERO - Object - Rank (↓) | LIBERO - Goal - SR (↑) | LIBERO - Goal - Rank (↓) | LIBERO - Long - SR (↑) | LIBERO - Long - Rank (↓) | Average - SR (↑) | Average - Rank (↓) | | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | | Diffusion Policy from scratch | 78.3 ± 1.1% | 5 | 92.5 ± 0.7% | 1 | 68.3 ± 1.2% | 5 | 50.5 ± 1.3% | 5 | 72.4 ± 0.7% | 5 | | Octo fine - tuned | 78.9 ± 1.0% | 4 | 85.7 ± 0.9% | 4 | 84.6 ± 0.9% | 1 | 51.1 ± 1.3% | 4 | 75.1 ± 0.6% | 3 | | OpenVLA fine - tuned | 84.7 ± 0.9% | 2 | 88.4 ± 0.8% | 3 | 79.2 ± 1.0% | 2 | 53.7 ± 1.3% | 3 | 76.5 ± 0.6% | 2 | | TraceVLA fine - tuned | 84.6 ± 0.2% | 3 | 85.2 ± 0.4% | 5 | 75.1 ± 0.3% | 4 | 54.1 ± 1.0% | 2 | 74.8 ± 0.5% | 4 | | SpatialVLA fine - tuned | 88.2 ± 0.5% | 1 | 89.9 ± 0.7% | 2 | 78.6 ± 0.6% | 3 | 55.5 ± 1.0% | 1 | 78.1 ± 0.7% | 1 |
Zero - shot Robot Control Evaluation on WidowX Robot:
Spatial Understanding Capability Evaluation:
Adapting to New Robot Setups on Franka Robot:

📄 License

This project is released under the MIT License.

Citation

BibTeX:

@misc{qu2025spatialvlaexploringspatialrepresentations,
      title={SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model}, 
      author={Delin Qu and Haoming Song and Qizhi Chen and Yuanqi Yao and Xinyi Ye and Yan Ding and Zhigang Wang and JiaYuan Gu and Bin Zhao and Dong Wang and Xuelong Li},
      year={2025},
      eprint={2501.15830},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2501.15830}, 
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご