OpenVLA v0.1 7B Open-source Model - A Visual Language Action Tool for Multiple Robot Controls

Openvla V01 7b

Developed by openvla

OpenVLA v0.1 7B is an open-source vision-language-action model trained on the Open X-Embodiment dataset, supporting various robot controls.

Text-to-Image

Transformers

EnglishOpen Source License:MIT #Robot Motion Control #Multimodal Vision-Language #Zero-shot Generalization

Downloads 30

Release Time : 6/10/2024

Model Overview

OpenVLA v0.1 7B is a vision-language-action model that takes language instructions and camera images as input to generate robot actions. It supports out-of-the-box control for multiple robots and can quickly adapt to new robot domains through fine-tuning.

Model Features

Multi-robot Support

Out-of-the-box control for various robots included in the pre-training data

Efficient Fine-tuning

Can be efficiently fine-tuned with minimal demonstration data to adapt to new tasks and robot setups

Open Source

All checkpoints and training codebases are released under the MIT license

Model Capabilities

Robot Action Prediction

Vision-Language Understanding

Multimodal Input Processing

Use Cases

Robot Control

Zero-shot Robot Control

Execute instructions on robot setups included in the pre-training data with zero-shot capability

Can control robots like Widow-X included in the pre-training data

New Domain Adaptation

Quickly adapt to new robot domains through fine-tuning

Requires collecting demonstration datasets for the target setup

🚀 OpenVLA v0.1 7B

OpenVLA v0.1 7B is an early model developed for research. For our best model, visit openvla/openvla-7b.

OpenVLA v0.1 7B (openvla-v01-7b) is an open-source vision-language-action model. It's trained on 800K robot manipulation episodes from the Open X-Embodiment dataset, the same data used by Octo. The model takes language instructions and camera images as input and generates robot actions. It can control multiple robots directly and be fine - tuned for new robot domains efficiently.

All OpenVLA checkpoints and training codebase are released under an MIT License.

For detailed information, refer to our paper and our project page.

✨ Features

Multimodal Input: Accepts both language instructions and camera images to generate robot actions.
Multi - Robot Compatibility: Supports controlling multiple robots out - of - the - box.
Fine - Tuning Capability: Can be fine - tuned for new robot domains with minimal data.

📦 Installation

The installation of minimal dependencies can be done using the following command:

pip install -r https://raw.githubusercontent.com/openvla/openvla/main/requirements-min.txt

💻 Usage Examples

Basic Usage

# Install minimal dependencies (`torch`, `transformers`, `timm`, `tokenizers`, ...)
# > pip install -r https://raw.githubusercontent.com/openvla/openvla/main/requirements-min.txt
from transformers import AutoModelForVision2Seq, AutoProcessor
from PIL import Image

import torch

# Load Processor & VLA
processor = AutoProcessor.from_pretrained("openvla/openvla-7b", trust_remote_code=True)
vla = AutoModelForVision2Seq.from_pretrained(
    "openvla/openvla-v01-7b",
    attn_implementation="flash_attention_2",  # [Optional] Requires `flash_attn`
    torch_dtype=torch.bfloat16, 
    low_cpu_mem_usage=True, 
    trust_remote_code=True
).to("cuda:0")

# Grab image input & format prompt (note inclusion of system prompt due to Vicuña base model)
image: Image.Image = get_from_camera(...)
system_prompt = (
    "A chat between a curious user and an artificial intelligence assistant. "
    "The assistant gives helpful, detailed, and polite answers to the user's questions."
)
prompt = f"{system_prompt} USER: What action should the robot take to {<INSTRUCTION>}? ASSISTANT:"

# Predict Action (7-DoF; un-normalize for BridgeV2)
inputs = processor(prompt, image).to("cuda:0", dtype=torch.bfloat16)
action = vla.predict_action(**inputs, unnorm_key="bridge_orig", do_sample=False)

# Execute...
robot.act(action, ...)

Advanced Usage

For more examples, including scripts for fine - tuning OpenVLA models on your own robot demonstration datasets, see our training repository.

📚 Documentation

Model Summary

Property	Details
Developed by	The OpenVLA team consisting of researchers from Stanford, UC Berkeley, Google Deepmind, and the Toyota Research Institute.
Model Type	Vision - language - action (language, image => robot actions)
Language(s) (NLP)	en
License	MIT
Finetuned from	`siglip-224px`, a VLM trained from: + Vision Backbone: SigLIP ViT - So400M/14 + Language Model: Vicuña v1.5
Pretraining Dataset	Open X - Embodiment -- specific component datasets can be found here.
Repository	https://github.com/openvla/openvla
Paper	OpenVLA: An Open - Source Vision - Language - Action Model
Project Page & Videos	https://openvla.github.io/

Uses

OpenVLA models take a language instruction and a camera image of a robot workspace as input, and predict (normalized) robot actions consisting of 7 - DoF end - effector deltas of the form (x, y, z, roll, pitch, yaw, gripper). To execute on an actual robot platform, actions need to be un - normalized subject to statistics computed on a per - robot, per - dataset basis. See our repository for more information.

OpenVLA models can be used zero - shot to control robots for specific combinations of embodiments and domains seen in the Open - X pretraining mixture (e.g., for BridgeV2 environments with a Widow - X robot). They can also be efficiently fine - tuned for new tasks and robot setups given minimal demonstration data; see here.

⚠️ Important Note

OpenVLA models do not zero - shot generalize to new (unseen) robot embodiments, or setups that are not represented in the pretraining mix; in these cases, we suggest collecting a dataset of demonstrations on the desired setup, and fine - tuning OpenVLA models instead.

📄 License

All OpenVLA checkpoints and the training codebase are released under an MIT License.

📚 Citation

BibTeX:

@article{kim24openvla,
    title={OpenVLA: An Open-Source Vision-Language-Action Model},
    author={{Moo Jin} Kim and Karl Pertsch and Siddharth Karamcheti and Ted Xiao and Ashwin Balakrishna and Suraj Nair and Rafael Rafailov and Ethan Foster and Grace Lam and Pannag Sanketi and Quan Vuong and Thomas Kollar and Benjamin Burchfiel and Russ Tedrake and Dorsa Sadigh and Sergey Levine and Percy Liang and Chelsea Finn},
    journal = {arXiv preprint arXiv:2406.09246},
    year={2024}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご