Spatialvla 4b 224 Pt

Developed by IPEC-COMMUNITY

SpatialVLAは110万の実ロボット操作シーンで訓練された空間拡張視覚言語動作モデルで、ロボット制御タスクに特化しています

マルチモーダル融合

Transformers

EnglishOpen Source License:MIT #ロボット動作予測 #空間拡張VLA #実シーン訓練

Downloads 13.06k

Release Time : 1/26/2025

Model Overview

PaLiGemma2アーキテクチャに基づく視覚-言語-動作モデルで、視覚入力と言語指示からロボット制御動作を生成可能

Model Features

空間拡張表現

空間理解能力を特別に最適化し、ロボット操作タスクにおける空間関係の処理能力を向上

大規模実データ訓練

110万の実ロボット操作シーンで訓練され、強力な実操作能力を有する

簡潔効率的実装

完全にHuggingFace Transformersベースで実装され、展開が容易

Model Capabilities

視覚指示理解

ロボット動作生成

空間関係推論

マルチモーダルタスク処理

Use Cases

ロボット制御

物体把持

視覚入力と言語指示に基づき物体把持動作シーケンスを生成

WidowXロボットでゼロショット制御を実現

新構成適応

少量の微調整で新しいロボット構成に適応

Frankaロボットへの適用に成功

空間理解

空間関係推論

物体間の空間関係を理解し対応する動作を生成

LIBEROベンチマークで優れた性能

license: mit base_model:

google/paligemma2-3b-pt-224 tags:
VLA
Foundation Vision-language-action Model
Generalist Robot Policy
robotics language:
en pipeline_tag: image-text-to-text library_name: transformers

SpatialVLA

SpatialVLA is a spatial-enhanced vision-language-action model trained on 1.1 Million real robot episodes. The code is purely huggingFace-based and concise, with efficient performance.

All SpatialVLA checkpoints, as well as our training codebase are released under an MIT License.

For full details, please read our paper and see our project page.

Model Details

Model Description

Developed by: The SpatialVLA team consisting of researchers from Shanghai AI Laboratory, ShanghaiTech and TeleAI.
Model type: Vision-language-action (language, image => robot actions)
Language(s) (NLP): en
License: MIT
Finetuned from model: paligemma2-3b-pt-224
Pretraining Dataset: Open X-Embodiment and RH20T
Repository: https://github.com/SpatialVLA/SpatialVLA
Paper: SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model
Project Page & Videos: https://spatialvla.github.io/

Uses

SpatialVLA relies solely on HuggingFace Transformers 🤗, making deployment extremely easy. If your environment supports transformers >= 4.47.0, you can directly use the following code to load the model and perform inference. (requires 8.5GB of GPU memory).

Direct Use

import torch
from PIL import Image
from transformers import AutoModel, AutoProcessor

model_name_or_path="IPEC-COMMUNITY/spatialvla-4b-224-pt"
processor = AutoProcessor.from_pretrained(model_name_or_path, trust_remote_code=True)

model = AutoModel.from_pretrained(model_name_or_path, trust_remote_code=True, torch_dtype=torch.bfloat16).eval().cuda()

image = Image.open("example.png").convert("RGB")
prompt = "What action should the robot take to pick the cup?"
inputs = processor(images=[image], text=prompt, return_tensors="pt")
generation_outputs = model.predict_action(inputs)

actions = processor.decode_actions(generation_outputs, unnorm_key="bridge_orig/1.0.0")
print(actions)

Out-of-Scope Use

SpatialVLA models do not zero-shot generalize to new (unseen) robot embodiments, or setups that are not represented in the pretraining mix; in these cases, we suggest collecting a dataset of demonstrations on the desired setup, and fine-tuning SpatialVLA models instead.

How to Get Hands Dirty with the Model

If you want to use the model for fine-tuning or pre-training, you need to clone the official repository first.

git clone https://github.com/SpatialVLA/SpatialVLA.git

, then install the required packages and download the model from the Hugging Face model hub. The VLM backbone of SpatialVLA is PaLiGemma2, which requires transformers >= 4.47.0. Hence, create a Python environment with Python >= 3.10.

conda create -n spatialvla python=3.10
conda activate spatialvla

Install packages from requirements.txt file. Note that we use a customised dlimp to support seed setting for reproducibility. If you catch any problems, please manually install the dlimp form the dlimp_custom.

pip install -r requirements.txt

Train from Scratch

SpatialVLA is pre-trained with 1.1 Million real-robot demonstrations from the OXE and RH20T dataset on a cluster of 64 A100 GPUs for abut 10 days, using a batch size of 2048. You can pre-train the model from scratch using the following command.

# torchrun
bash scripts/spatialvla_4b_pretrain/torchrun_pretrain.sh

# or in a slurm cluster
bash scripts/spatialvla_4b_pretrain/slurm_pretrain.sh

Fine-tuning

Most of our fine-tuning experiments are conducted using LoRA on 4 or 8 A100 GPUs. You can use the following scripts for full-parameter or LoRA fine-tuning. For real-world experiments with small datasets, we prefer using LoRA for fine-tuning.

# full fine-tuning
bash scripts/spatialvla_4b_finetune/finetune_full.sh

# LoRA fine-tuning
bash scripts/spatialvla_4b_finetune/finetune_lora.sh

Evaluation

SimplerEnv evaluation on Google Robot tasks.

Model	Visual Matching				Variant Aggregation
Model	Pick Coke Can	Move Near	Open/Close Drawer	#Average	Pick Coke Can	Move Near	Open/Close Drawer	#Average
RT-1 (Begin)	2.7%	5.0%	13.9%	6.8%	2.2%	4.0%	6.9%	4.2%
RT-1 (15%)	71.0%	35.4%	56.5%	60.2%	81.3%	44.6%	26.7%	56.2%
RT-1 (Converged)	85.7%	44.2%	73.0%	74.6%	89.8%	50.0%	32.3%	63.3%
HPT	56.0%	60.0%	24.0%	46.0%	--	--	31.0%	45.0%
TraceVLA	28.0%	53.7%	57.0%	42.0%	60.0%	56.4%	29.4%	39.6%
RT-1-X	56.7%	31.7%	59.7%	53.4%	49.0%	32.3%	35.3%	64.3%
RT-2-X	78.7%	77.9%	25.0%	60.7%	82.3%	79.2%	--	--
Octo-Base	17.0%	4.2%	22.7%	16.8%	0.6%	3.1%	1.1%	1.1%
OpenVLA	16.3%	46.2%	35.6%	27.7%	54.5%	47.7%	17.7%	39.8%
RoboVLM (zero-shot)	72.7%	66.3%	26.8%	56.3%	68.3%	56.0%	8.5%	46.3%
RoboVLM (fine-tuning)	77.3%	61.7%	43.5%	63.4%	75.6%	60.0%	10.6%	51.3%
SpatialVLA (zero-shot)	81.0%	69.6%	59.3%	71.9%	89.5%	71.7%	36.2%	68.8%
SpatialVLA (fine-tuning)	86.0%	77.9%	57.4%	75.1%	88.0%	72.7%	41.8%	70.7%

SimplerEnv evaluation on WidowX Robot tasks.

Model	Put Spoon on Towel		Put Carrot on Plate		Stack Green Block on Yellow Block		Put Eggplant in Yellow Basket		#Overall Average
Model	Grasp Spoon	Success	Grasp Carrot	Success	Grasp Green Block	Success	Grasp Eggplant	Success	#Overall Average
RT-1-X	16.7%	0.0%	20.8%	4.2%	8.3%	0.0%	0.0%	0.0%	1.1%
Octo-Base	34.7%	12.5%	52.8%	8.3%	31.9%	0.0%	66.7%	43.1%	16.0%
Octo-Small	77.8%	47.2%	27.8%	9.7%	40.3%	4.2%	87.5%	56.9%	30.0%
OpenVLA	4.1%	0.0%	33.3%	0.0%	12.5%	0.0%	8.3%	4.1%	1.0%
RoboVLM (zero-shot)	37.5%	20.8%	33.3%	25.0%	8.3%	8.3%	0.0%	0.0%	13.5%
RoboVLM (fine-tuning)	54.2%	29.2%	25.0%	25.0%	45.8%	12.5%	58.3%	58.3%	31.3%
SpatialVLA (zero-shot)	25.0%	20.8%	41.7%	20.8%	58.3%	25.0%	79.2%	70.8%	34.4%
SpatialVLA (fine-tuning)	20.8%	16.7%	29.2%	25.0%	62.5%	29.2%	100.0%	100.0%	42.7%

LIBERO Simulation Benchmark Results.

Model	LIBERO-Spatial		LIBERO-Object		LIBERO-Goal		LIBERO-Long		Average
Model	SR (↑)	Rank (↓)	SR (↑)	Rank (↓)	SR (↑)	Rank (↓)	SR (↑)	Rank (↓)	SR (↑)	Rank (↓)
Diffusion Policy from scratch	78.3 ± 1.1%	5	92.5 ± 0.7%	1	68.3 ± 1.2%	5	50.5 ± 1.3%	5	72.4 ± 0.7%	5
Octo fine-tuned	78.9 ± 1.0%	4	85.7 ± 0.9%	4	84.6 ± 0.9%	1	51.1 ± 1.3%	4	75.1 ± 0.6%	3
OpenVLA fine-tuned	84.7 ± 0.9%	2	88.4 ± 0.8%	3	79.2 ± 1.0%	2	53.7 ± 1.3%	3	76.5 ± 0.6%	2
TraceVLA fine-tuned	84.6 ± 0.2%	3	85.2 ± 0.4%	5	75.1 ± 0.3%	4	54.1 ± 1.0%	2	74.8 ± 0.5%	4
SpatialVLA fine-tuned	88.2 ± 0.5%	1	89.9 ± 0.7%	2	78.6 ± 0.6%	3	55.5 ± 1.0%	1	78.1 ± 0.7%	1

Zero-shot Robot Control Evaluation on WidowX Robot.

Spatial Understanding Capability Evaluation.

Adapting to New Robot Setups on Franka Robot.

Citation

BibTeX:

@misc{qu2025spatialvlaexploringspatialrepresentations,
      title={SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model}, 
      author={Delin Qu and Haoming Song and Qizhi Chen and Yuanqi Yao and Xinyi Ye and Yan Ding and Zhigang Wang and JiaYuan Gu and Bin Zhao and Dong Wang and Xuelong Li},
      year={2025},
      eprint={2501.15830},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2501.15830}, 
}