🚀 OpenVLA v0.1 7B
OpenVLA v0.1 7B is an early model developed for research. For our best model, visit openvla/openvla-7b.
OpenVLA v0.1 7B (openvla-v01-7b
) is an open-source vision-language-action model. It's trained on 800K robot manipulation episodes from the Open X-Embodiment dataset, the same data used by Octo. The model takes language instructions and camera images as input and generates robot actions. It can control multiple robots directly and be fine - tuned for new robot domains efficiently.
All OpenVLA checkpoints and training codebase are released under an MIT License.
For detailed information, refer to our paper and our project page.
✨ Features
- Multimodal Input: Accepts both language instructions and camera images to generate robot actions.
- Multi - Robot Compatibility: Supports controlling multiple robots out - of - the - box.
- Fine - Tuning Capability: Can be fine - tuned for new robot domains with minimal data.
📦 Installation
The installation of minimal dependencies can be done using the following command:
pip install -r https://raw.githubusercontent.com/openvla/openvla/main/requirements-min.txt
💻 Usage Examples
Basic Usage
from transformers import AutoModelForVision2Seq, AutoProcessor
from PIL import Image
import torch
processor = AutoProcessor.from_pretrained("openvla/openvla-7b", trust_remote_code=True)
vla = AutoModelForVision2Seq.from_pretrained(
"openvla/openvla-v01-7b",
attn_implementation="flash_attention_2",
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
trust_remote_code=True
).to("cuda:0")
image: Image.Image = get_from_camera(...)
system_prompt = (
"A chat between a curious user and an artificial intelligence assistant. "
"The assistant gives helpful, detailed, and polite answers to the user's questions."
)
prompt = f"{system_prompt} USER: What action should the robot take to {<INSTRUCTION>}? ASSISTANT:"
inputs = processor(prompt, image).to("cuda:0", dtype=torch.bfloat16)
action = vla.predict_action(**inputs, unnorm_key="bridge_orig", do_sample=False)
robot.act(action, ...)
Advanced Usage
For more examples, including scripts for fine - tuning OpenVLA models on your own robot demonstration datasets, see our training repository.
📚 Documentation
Model Summary
Uses
OpenVLA models take a language instruction and a camera image of a robot workspace as input, and predict (normalized) robot actions consisting of 7 - DoF end - effector deltas of the form (x, y, z, roll, pitch, yaw, gripper). To execute on an actual robot platform, actions need to be un - normalized subject to statistics computed on a per - robot, per - dataset basis. See our repository for more information.
OpenVLA models can be used zero - shot to control robots for specific combinations of embodiments and domains seen in the Open - X pretraining mixture (e.g., for BridgeV2 environments with a Widow - X robot). They can also be efficiently fine - tuned for new tasks and robot setups given minimal demonstration data; see here.
⚠️ Important Note
OpenVLA models do not zero - shot generalize to new (unseen) robot embodiments, or setups that are not represented in the pretraining mix; in these cases, we suggest collecting a dataset of demonstrations on the desired setup, and fine - tuning OpenVLA models instead.
📄 License
All OpenVLA checkpoints and the training codebase are released under an MIT License.
📚 Citation
BibTeX:
@article{kim24openvla,
title={OpenVLA: An Open-Source Vision-Language-Action Model},
author={{Moo Jin} Kim and Karl Pertsch and Siddharth Karamcheti and Ted Xiao and Ashwin Balakrishna and Suraj Nair and Rafael Rafailov and Ethan Foster and Grace Lam and Pannag Sanketi and Quan Vuong and Thomas Kollar and Benjamin Burchfiel and Russ Tedrake and Dorsa Sadigh and Sergey Levine and Percy Liang and Chelsea Finn},
journal = {arXiv preprint arXiv:2406.09246},
year={2024}
}