SpatialVLA-4B-Mix-224-PT Open-source Model - A Vision-Language-Action Tool Empowering Robot Control Tasks

Spatialvla 4b Mix 224 Pt

Developed by IPEC-COMMUNITY

SpatialVLA is a vision-language-action model obtained by fine-tuning the base model on fractal and bridge datasets, specifically designed for robotic control tasks.

Transformers

EnglishOpen Source License:MIT #Robot Action Prediction #Vision-Language-Action Joint Modeling #Cross-Configuration Transfer Learning

Downloads 72

Release Time : 1/26/2025

Model Overview

This model primarily converts language instructions and visual inputs into robot actions, suitable for general robot policy development.

Model Features

Vision-Language-Action Integration

Capable of processing both visual inputs and language instructions to output robot action sequences

Large-Scale Pretraining

Pretrained on 1.1 million real robot demonstration data from Open X-Embodiment and RH20T

Domain-Adaptive Fine-Tuning

Optimized and fine-tuned for specific tasks on fractal and bridge datasets

Spatial Understanding Capability

Particular emphasis on understanding and expressing spatial relationships

Model Capabilities

Vision-Language Understanding

Robot Action Generation

Spatial Relationship Reasoning

Multimodal Task Processing

Use Cases

Robot Control

Object Grasping

Generates grasping action sequences based on visual inputs and language instructions

Performs well in Google Robot tasks

Spatial Navigation

Understands spatial relationships and generates navigation paths

Achieves good results in WidowX Robot tasks

🚀 SpatialVLA Fine-Tuned on fractal & bridge

This model is created by fine-tuning the SpatialVLA model on the fractal and bridge dataset. We made several adjustments to the training dataset to enhance the final performance (for detailed information, refer to the SpatialVLA paper). This model is solely used in our TABLE V for Fine-tuning Ablations in Domain Datasets.

✨ Features

Easy Deployment: SpatialVLA relies solely on HuggingFace Transformers 🤗, making deployment extremely easy.
High Performance: Fine-tuned on specific datasets to improve performance.

📦 Installation

If you want to use the model for fine-tuning or pre-training, follow these steps:

Clone the official repository:

git clone https://github.com/SpatialVLA/SpatialVLA.git

Create a Python environment with Python >= 3.10:

conda create -n spatialvla python=3.10
conda activate spatialvla

Install packages from the requirements.txt file. Note that a customised dlimp is used to support seed setting for reproducibility. If you encounter any issues, manually install dlimp from dlimp_custom.

pip install -r requirements.txt

💻 Usage Examples

Basic Usage

If your environment supports transformers >= 4.47.0, you can directly use the following code to load the model and perform inference (requires 8.5GB of GPU memory):

import torch
from PIL import Image
from transformers import AutoModel, AutoProcessor

model_name_or_path="IPEC-COMMUNITY/spatialvla-4b-224-pt"
processor = AutoProcessor.from_pretrained(model_name_or_path, trust_remote_code=True)

model = AutoModel.from_pretrained(model_name_or_path, trust_remote_code=True, torch_dtype=torch.bfloat16).eval().cuda()

image = Image.open("example.png").convert("RGB")
prompt = "What action should the robot take to pick the cup?"
inputs = processor(images=[image], text=prompt, return_tensors="pt")
generation_outputs = model.predict_action(inputs)

actions = processor.decode_actions(generation_outputs, unnorm_key="bridge_orig/1.0.0")
print(actions)

Advanced Usage

Train from Scratch

SpatialVLA is pre-trained with 1.1 Million real-robot demonstrations from the OXE and RH20T dataset on a cluster of 64 A100 GPUs for about 10 days, using a batch size of 2048. You can pre-train the model from scratch using the following command:

# torchrun
bash scripts/spatialvla_4b_pretrain/torchrun_pretrain.sh

# or in a slurm cluster
bash scripts/spatialvla_4b_pretrain/slurm_pretrain.sh

Fine-tuning

Most of our fine-tuning experiments are conducted using LoRA on 4 or 8 A100 GPUs. You can use the following scripts for full-parameter or LoRA fine-tuning. For real-world experiments with small datasets, we recommend using LoRA for fine-tuning.

# full fine-tuning
bash scripts/spatialvla_4b_finetune/finetune_full.sh

# LoRA fine-tuning
bash scripts/spatialvla_4b_finetune/finetune_lora.sh

📚 Documentation

Model Details

Model Description

Property	Details
Developed by	The SpatialVLA team consisting of researchers from Shanghai AI Laboratory, ShanghaiTech and TeleAI.
Model Type	Vision-language-action (language, image => robot actions)
Language(s) (NLP)	en
License	MIT
Finetuned from model	paligemma2-3b-pt-224
Pretraining Dataset	Open X-Embodiment and RH20T
Repository	https://github.com/SpatialVLA/SpatialVLA
Paper	SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model
Project Page & Videos	https://spatialvla.github.io/

Uses

Direct Use: As shown in the basic usage example above.
Out-of-Scope Use: SpatialVLA models do not zero-shot generalize to new (unseen) robot embodiments, or setups that are not represented in the pretraining mix. In these cases, we suggest collecting a dataset of demonstrations on the desired setup and fine-tuning SpatialVLA models instead.

Evaluation

SimplerEnv evaluation on Google Robot tasks: | Model | Visual Matching - Pick Coke Can | Visual Matching - Move Near | Visual Matching - Open/Close Drawer | Visual Matching - #Average | Variant Aggregation - Pick Coke Can | Variant Aggregation - Move Near | Variant Aggregation - Open/Close Drawer | Variant Aggregation - #Average | | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | | RT-1 (Begin) | 2.7% | 5.0% | 13.9% | 6.8% | 2.2% | 4.0% | 6.9% | 4.2% | | RT-1 (15%) | 71.0% | 35.4% | 56.5% | 60.2% | 81.3% | 44.6% | 26.7% | 56.2% | | RT-1 (Converged) | 85.7% | 44.2% | 73.0% | 74.6% | 89.8% | 50.0% | 32.3% | 63.3% | | HPT | 56.0% | 60.0% | 24.0% | 46.0% | -- | -- | 31.0% | 45.0% | | TraceVLA | 28.0% | 53.7% | 57.0% | 42.0% | 60.0% | 56.4% | 29.4% | 39.6% | | RT-1-X | 56.7% | 31.7% | 59.7% | 53.4% | 49.0% | 32.3% | 35.3% | 64.3% | | RT-2-X | 78.7% | 77.9% | 25.0% | 60.7% | 82.3% | 79.2% | -- | -- | | Octo-Base | 17.0% | 4.2% | 22.7% | 16.8% | 0.6% | 3.1% | 1.1% | 1.1% | | OpenVLA | 16.3% | 46.2% | 35.6% | 27.7% | 54.5% | 47.7% | 17.7% | 39.8% | | RoboVLM (zero-shot) | 72.7% | 66.3% | 26.8% | 56.3% | 68.3% | 56.0% | 8.5% | 46.3% | | RoboVLM (fine-tuning) | 77.3% | 61.7% | 43.5% | 63.4% | 75.6% | 60.0% | 10.6% | 51.3% | | SpatialVLA (zero-shot) | 81.0% | 69.6% | 59.3% | 71.9% | 89.5% | 71.7% | 36.2% | 68.8% | | SpatialVLA (fine-tuning) | 86.0% | 77.9% | 57.4% | 75.1% | 88.0% | 72.7% | 41.8% | 70.7% |
SimplerEnv evaluation on WidowX Robot tasks: | Model | Put Spoon on Towel - Grasp Spoon | Put Spoon on Towel - Success | Put Carrot on Plate - Grasp Carrot | Put Carrot on Plate - Success | Stack Green Block on Yellow Block - Grasp Green Block | Stack Green Block on Yellow Block - Success | Put Eggplant in Yellow Basket - Grasp Eggplant | Put Eggplant in Yellow Basket - Success | #Overall Average | | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | | RT-1-X | 16.7% | 0.0% | 20.8% | 4.2% | 8.3% | 0.0% | 0.0% | 0.0% | 1.1% | | Octo-Base | 34.7% | 12.5% | 52.8% | 8.3% | 31.9% | 0.0% | 66.7% | 43.1% | 16.0% | | Octo-Small | 77.8% | 47.2% | 27.8% | 9.7% | 40.3% | 4.2% | 87.5% | 56.9% | 30.0% | | OpenVLA | 4.1% | 0.0% | 33.3% | 0.0% | 12.5% | 0.0% | 8.3% | 4.1% | 1.0% | | RoboVLM (zero-shot) | 37.5% | 20.8% | 33.3% | 25.0% | 8.3% | 8.3% | 0.0% | 0.0% | 13.5% | | RoboVLM (fine-tuning) | 54.2% | 29.2% | 25.0% | 25.0% | 45.8% | 12.5% | 58.3% | 58.3% | 31.3% | | SpatialVLA (zero-shot) | 25.0% | 20.8% | 41.7% | 20.8% | 58.3% | 25.0% | 79.2% | 70.8% | 34.4% | | SpatialVLA (fine-tuning) | 20.8% | 16.7% | 29.2% | 25.0% | 62.5% | 29.2% | 100.0% | 100.0% | 42.7% |
Zero-shot Robot Control Evaluation on WidowX Robot:
Spatial Understanding Capability Evaluation:

📄 License

This model is licensed under the MIT license.

🔧 Technical Details

The VLM backbone of SpatialVLA is PaLiGemma2, which requires transformers >= 4.47.0.

📄 Citation

BibTeX:

@misc{qu2025spatialvlaexploringspatialrepresentations,
      title={SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model}, 
      author={Delin Qu and Haoming Song and Qizhi Chen and Yuanqi Yao and Xinyi Ye and Yan Ding and Zhigang Wang and JiaYuan Gu and Bin Zhao and Dong Wang and Xuelong Li},
      year={2025},
      eprint={2501.15830},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2501.15830}, 
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご