Open-source SpatialVLA-4B-224-SFT-Bridge Model - Empowering Visual Language Action Applications in the Simpler-env Benchmark

Spatialvla 4b 224 Sft Bridge

Developed by IPEC-COMMUNITY

This model is a vision-language-action model fine-tuned on the bridge dataset based on the SpatialVLA model, specifically designed for the Simpler-env benchmark.

Text-to-Image

Transformers

EnglishOpen Source License:MIT #Robot Motion Control #Vision-Language-Action Model #Spatial Representation Learning

Downloads 1,066

Release Time : 3/16/2025

Model Overview

SpatialVLA is a vision-language-action model capable of generating robot motion instructions based on image and text inputs.

Model Features

Vision-Language-Action Integration

Capable of processing both visual and language inputs to output robot motion instructions.

Trained on Large-Scale Robot Data

Pre-trained using Open X-Embodiment and RH20T datasets.

Spatial Understanding Capability

Specifically optimized for understanding and expressing spatial relationships.

Easy Deployment

Fully based on HuggingFace Transformers, making deployment straightforward.

Model Capabilities

Vision-Language Understanding

Robot Motion Generation

Spatial Relationship Reasoning

Multimodal Task Processing

Use Cases

Robot Control

Object Grasping

Generates motion sequences for grasping objects based on visual input and text instructions.

Performs well in Google Robot tasks.

Object Placement

Places specified objects at target locations.

Demonstrates high success rates in WidowX Robot tasks.

Spatial Understanding

Spatial Relationship Reasoning

Understands relative positional relationships between objects.

Excels in spatial understanding evaluations.

🚀 SpatialVLA Fine-Tuned on fractal & bridge

This model is generated by fine-tuning the SpatialVLA model on the bridge dataset for the Simpler-env benchmark, offering high - performance vision - language - action capabilities.

🚀 Quick Start

SpatialVLA relies solely on HuggingFace Transformers 🤗, making deployment extremely easy. If your environment supports transformers >= 4.47.0, you can directly use the following code to load the model and perform inference. (requires 8.5GB of GPU memory).

import torch
from PIL import Image
from transformers import AutoModel, AutoProcessor

model_name_or_path="IPEC-COMMUNITY/spatialvla-4b-224-pt"
processor = AutoProcessor.from_pretrained(model_name_or_path, trust_remote_code=True)

model = AutoModel.from_pretrained(model_name_or_path, trust_remote_code=True, torch_dtype=torch.bfloat16).eval().cuda()

image = Image.open("example.png").convert("RGB")
prompt = "What action should the robot take to pick the cup?"
inputs = processor(images=[image], text=prompt, return_tensors="pt")
generation_outputs = model.predict_action(inputs)

actions = processor.decode_actions(generation_outputs, unnorm_key="bridge_orig/1.0.0")
print(actions)

✨ Features

Multidisciplinary Team Development: Developed by the SpatialVLA team composed of researchers from Shanghai AI Laboratory, ShanghaiTech and TeleAI.
Advanced Model Type: Vision - language - action model, converting language and image inputs into robot actions.
Rich Data Support: Pretrained on large - scale datasets such as [Open X - Embodiment](https://robotics - transformer - x.github.io/) and RH20T.
Easy Deployment: Relying on HuggingFace Transformers, it can be quickly deployed in environments that meet the requirements.

📦 Installation

If you want to use the model for fine - tuning or pre - training, you need to follow these steps:

Clone the Repository

git clone https://github.com/SpatialVLA/SpatialVLA.git

Create a Python Environment

conda create -n spatialvla python=3.10
conda activate spatialvla

Install Required Packages

pip install -r requirements.txt

Note that we use a customised dlimp to support seed setting for reproducibility. If you catch any problems, please manually install the dlimp form the dlimp_custom.

💻 Usage Examples

Basic Usage

import torch
from PIL import Image
from transformers import AutoModel, AutoProcessor

model_name_or_path="IPEC-COMMUNITY/spatialvla-4b-224-pt"
processor = AutoProcessor.from_pretrained(model_name_or_path, trust_remote_code=True)

model = AutoModel.from_pretrained(model_name_or_path, trust_remote_code=True, torch_dtype=torch.bfloat16).eval().cuda()

image = Image.open("example.png").convert("RGB")
prompt = "What action should the robot take to pick the cup?"
inputs = processor(images=[image], text=prompt, return_tensors="pt")
generation_outputs = model.predict_action(inputs)

actions = processor.decode_actions(generation_outputs, unnorm_key="bridge_orig/1.0.0")
print(actions)

Advanced Usage

Train from Scratch

SpatialVLA is pre - trained with 1.1 Million real - robot demonstrations from the OXE and RH20T dataset on a cluster of 64 A100 GPUs for about 10 days, using a batch size of 2048. You can pre - train the model from scratch using the following command.

# torchrun
bash scripts/spatialvla_4b_pretrain/torchrun_pretrain.sh

# or in a slurm cluster
bash scripts/spatialvla_4b_pretrain/slurm_pretrain.sh

Fine - tuning

Most of our fine - tuning experiments are conducted using LoRA on 4 or 8 A100 GPUs. You can use the following scripts for full - parameter or LoRA fine - tuning. For real - world experiments with small datasets, we prefer using LoRA for fine - tuning.

# full fine - tuning
bash scripts/spatialvla_4b_finetune/finetune_full.sh

# LoRA fine - tuning
bash scripts/spatialvla_4b_finetune/finetune_lora.sh

📚 Documentation

Model Details

Property	Details
Developed by	The SpatialVLA team consisting of researchers from Shanghai AI Laboratory, ShanghaiTech and TeleAI.
Model Type	Vision - language - action (language, image => robot actions)
Language(s) (NLP)	en
License	MIT
Finetuned from model	[paligemma2 - 3b - pt - 224](https://huggingface.co/google/paligemma2 - 3b - pt - 224)
Pretraining Dataset	[Open X - Embodiment](https://robotics - transformer - x.github.io/) and RH20T
Repository	https://github.com/SpatialVLA/SpatialVLA
Paper	SpatialVLA: Exploring Spatial Representations for Visual - Language - Action Model
Project Page & Videos	https://spatialvla.github.io/

Evaluation

SimplerEnv evaluation on Google Robot tasks

Model	Visual Matching - Pick Coke Can	Visual Matching - Move Near	Visual Matching - Open/Close Drawer	Visual Matching - #Average	Variant Aggregation - Pick Coke Can	Variant Aggregation - Move Near	Variant Aggregation - Open/Close Drawer	Variant Aggregation - #Average
RT - 1 (Begin)	2.7%	5.0%	13.9%	6.8%	2.2%	4.0%	6.9%	4.2%
RT - 1 (15%)	71.0%	35.4%	56.5%	60.2%	81.3%	44.6%	26.7%	56.2%
RT - 1 (Converged)	85.7%	44.2%	73.0%	74.6%	89.8%	50.0%	32.3%	63.3%
HPT	56.0%	60.0%	24.0%	46.0%	--	--	31.0%	45.0%
TraceVLA	28.0%	53.7%	57.0%	42.0%	60.0%	56.4%	29.4%	39.6%
RT - 1 - X	56.7%	31.7%	59.7%	53.4%	49.0%	32.3%	35.3%	64.3%
RT - 2 - X	78.7%	77.9%	25.0%	60.7%	82.3%	79.2%	--	--
Octo - Base	17.0%	4.2%	22.7%	16.8%	0.6%	3.1%	1.1%	1.1%
OpenVLA	16.3%	46.2%	35.6%	27.7%	54.5%	47.7%	17.7%	39.8%
RoboVLM (zero - shot)	72.7%	66.3%	26.8%	56.3%	68.3%	56.0%	8.5%	46.3%
RoboVLM (fine - tuning)	77.3%	61.7%	43.5%	63.4%	75.6%	60.0%	10.6%	51.3%
SpatialVLA (zero - shot)	81.0%	69.6%	59.3%	71.9%	89.5%	71.7%	36.2%	68.8%
SpatialVLA (fine - tuning)	86.0%	77.9%	57.4%	75.1%	88.0%	72.7%	41.8%	70.7%

SimplerEnv evaluation on WidowX Robot tasks

Model	Put Spoon on Towel - Grasp Spoon	Put Spoon on Towel - Success	Put Carrot on Plate - Grasp Carrot	Put Carrot on Plate - Success	Stack Green Block on Yellow Block - Grasp Green Block	Stack Green Block on Yellow Block - Success	Put Eggplant in Yellow Basket - Grasp Eggplant	Put Eggplant in Yellow Basket - Success	#Overall Average
RT - 1 - X	16.7%	0.0%	20.8%	4.2%	8.3%	0.0%	0.0%	0.0%	1.1%
Octo - Base	34.7%	12.5%	52.8%	8.3%	31.9%	0.0%	66.7%	43.1%	16.0%
Octo - Small	77.8%	47.2%	27.8%	9.7%	40.3%	4.2%	87.5%	56.9%	30.0%
OpenVLA	4.1%	0.0%	33.3%	0.0%	12.5%	0.0%	8.3%	4.1%	1.0%
RoboVLM (zero - shot)	37.5%	20.8%	33.3%	25.0%	8.3%	8.3%	0.0%	0.0%	13.5%
RoboVLM (fine - tuning)	54.2%	29.2%	25.0%	25.0%	45.8%	12.5%	58.3%	58.3%	31.3%
SpatialVLA (zero - shot)	25.0%	20.8%	41.7%	20.8%	58.3%	25.0%	79.2%	70.8%	34.4%
SpatialVLA (fine - tuning)	20.8%	16.7%	29.2%	25.0%	62.5%	29.2%	100.0%	100.0%	42.7%

Zero - shot Robot Control Evaluation on WidowX Robot

![perform](https://cdn - uploads.huggingface.co/production/uploads/6535045a910b844786a6642f/SUPyXwcdfnWranO04tulL.png)

Spatial Understanding Capability Evaluation

![perform](https://cdn - uploads.huggingface.co/production/uploads/6535045a910b844786a6642f/g - EfM - 6M7iM9IYryUTwLA.png)

📄 License

This model is licensed under the MIT license.

🔧 Technical Details

The model is fine - tuned from [paligemma2 - 3b - pt - 224](https://huggingface.co/google/paligemma2 - 3b - pt - 224) on the bridge dataset for the Simpler - env benchmark. It is a vision - language - action model that can convert language and image inputs into robot actions. The model is pre - trained on large - scale datasets such as [Open X - Embodiment](https://robotics - transformer - x.github.io/) and RH20T, and can be easily deployed using HuggingFace Transformers.

📄 Citation

BibTeX:

@misc{qu2025spatialvlaexploringspatialrepresentations,
      title={SpatialVLA: Exploring Spatial Representations for Visual - Language - Action Model}, 
      author={Delin Qu and Haoming Song and Qizhi Chen and Yuanqi Yao and Xinyi Ye and Yan Ding and Zhigang Wang and JiaYuan Gu and Bin Zhao and Dong Wang and Xuelong Li},
      year={2025},
      eprint={2501.15830},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2501.15830}, 
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご