OpenVLA 7B Open-Source Vision-Language-Action Model - Generate Robot Actions Based on Instructions and Images

Openvla 7b

Developed by openvla

OpenVLA 7B is an open-source vision-language-action model trained on the Open X-Embodiment dataset, capable of generating robot actions based on language instructions and camera images.

Image-to-Text

Transformers

EnglishOpen Source License:MIT #Robot Control #Multimodal Instructions #Zero-shot Generalization

Downloads 1.7M

Release Time : 6/10/2024

Model Overview

OpenVLA 7B is a multimodal model that takes language instructions and camera images of the robot workspace as input, predicting 7-degree-of-freedom end-effector displacements. It supports various robot controls and can quickly adapt to new robot domains through fine-tuning.

Model Features

Multi-robot Support

Out-of-the-box control for various robots included in the pretraining mixed dataset

Parameter-efficient Fine-tuning

Efficiently adapt to new tasks and robot configurations with minimal demonstration data

Open-source Training Code

Complete training codebase released under MIT license, supporting custom training

Multimodal Input

Processes both language instructions and visual inputs to generate precise robot actions

Model Capabilities

Robot action prediction

Vision-language understanding

Multimodal task processing

Robot control

Use Cases

Robot Control

Widow-X Robot Control

Control the Widow-X robot to execute instructions in the BridgeV2 environment

Zero-shot execution of tasks included in the pretraining mixed dataset

New Robot Adaptation

Fine-tune with minimal demonstration data to adapt to new robot configurations

Quick adaptation to new tasks and robot environments

🚀 OpenVLA 7B

OpenVLA 7B (openvla-7b) is an open vision-language-action model. It is trained on 970K robot manipulation episodes from the Open X-Embodiment dataset. This model takes language instructions and camera images as input to generate robot actions. It can directly control multiple robots and can be quickly adapted to new robot domains through (parameter-efficient) fine-tuning.

All OpenVLA checkpoints and our training codebase are released under an MIT License. For full details, please read our paper and visit our project page.

🚀 Quick Start

OpenVLA 7B can be used to control multiple robots for domains represented in the pretraining mixture out-of-the-box. For example, here is an example for loading openvla-7b for zero-shot instruction following in the [BridgeV2 environments] with a Widow-X robot:

# Install minimal dependencies (`torch`, `transformers`, `timm`, `tokenizers`, ...)
# > pip install -r https://raw.githubusercontent.com/openvla/openvla/main/requirements-min.txt
from transformers import AutoModelForVision2Seq, AutoProcessor
from PIL import Image

import torch

# Load Processor & VLA
processor = AutoProcessor.from_pretrained("openvla/openvla-7b", trust_remote_code=True)
vla = AutoModelForVision2Seq.from_pretrained(
    "openvla/openvla-7b",
    attn_implementation="flash_attention_2",  # [Optional] Requires `flash_attn`
    torch_dtype=torch.bfloat16, 
    low_cpu_mem_usage=True, 
    trust_remote_code=True
).to("cuda:0")

# Grab image input & format prompt
image: Image.Image = get_from_camera(...)
prompt = "In: What action should the robot take to {<INSTRUCTION>}?\nOut:"

# Predict Action (7-DoF; un-normalize for BridgeV2)
inputs = processor(prompt, image).to("cuda:0", dtype=torch.bfloat16)
action = vla.predict_action(**inputs, unnorm_key="bridge_orig", do_sample=False)

# Execute...
robot.act(action, ...)

For more examples, including scripts for fine-tuning OpenVLA models on your own robot demonstration datasets, see our training repository.

✨ Features

Multi - robot control: It can directly control multiple robots without additional complex configurations.
Adaptability: Can be quickly adapted to new robot domains through (parameter - efficient) fine - tuning.
Zero - shot application: Can be used for zero - shot control of robots in specific combinations of embodiments and domains seen in the Open - X pretraining mixture.

📦 Installation

To use OpenVLA 7B, you need to install the minimal dependencies:

pip install -r https://raw.githubusercontent.com/openvla/openvla/main/requirements-min.txt

💻 Usage Examples

Basic Usage

The following code shows how to use OpenVLA 7B for zero - shot instruction following in the [BridgeV2 environments] with a Widow - X robot:

# Install minimal dependencies (`torch`, `transformers`, `timm`, `tokenizers`, ...)
# > pip install -r https://raw.githubusercontent.com/openvla/openvla/main/requirements-min.txt
from transformers import AutoModelForVision2Seq, AutoProcessor
from PIL import Image

import torch

# Load Processor & VLA
processor = AutoProcessor.from_pretrained("openvla/openvla-7b", trust_remote_code=True)
vla = AutoModelForVision2Seq.from_pretrained(
    "openvla/openvla-7b",
    attn_implementation="flash_attention_2",  # [Optional] Requires `flash_attn`
    torch_dtype=torch.bfloat16, 
    low_cpu_mem_usage=True, 
    trust_remote_code=True
).to("cuda:0")

# Grab image input & format prompt
image: Image.Image = get_from_camera(...)
prompt = "In: What action should the robot take to {<INSTRUCTION>}?\nOut:"

# Predict Action (7-DoF; un-normalize for BridgeV2)
inputs = processor(prompt, image).to("cuda:0", dtype=torch.bfloat16)
action = vla.predict_action(**inputs, unnorm_key="bridge_orig", do_sample=False)

# Execute...
robot.act(action, ...)

Advanced Usage

For fine - tuning OpenVLA models on your own robot demonstration datasets, you can refer to the scripts in our training repository.

📚 Documentation

Model Summary

Property	Details
Developed by	The OpenVLA team consisting of researchers from Stanford, UC Berkeley, Google Deepmind, and the Toyota Research Institute.
Model Type	Vision - language - action (language, image => robot actions)
Language(s) (NLP)	en
License	MIT
Finetuned from	[`prism - dinosiglip - 224px`](https://github.com/TRI - ML/prismatic - vlms), a VLM trained from: + Vision Backbone: DINOv2 ViT - L/14 and SigLIP ViT - So400M/14 + Language Model: Llama - 2
Pretraining Dataset	[Open X - Embodiment](https://robotics - transformer - x.github.io/) -- specific component datasets can be found here.
Repository	https://github.com/openvla/openvla
Paper	OpenVLA: An Open - Source Vision - Language - Action Model
Project Page & Videos	https://openvla.github.io/

Uses

OpenVLA models take a language instruction and a camera image of a robot workspace as input, and predict (normalized) robot actions consisting of 7 - DoF end - effector deltas of the form (x, y, z, roll, pitch, yaw, gripper). To execute on an actual robot platform, actions need to be un - normalized subject to statistics computed on a per - robot, per - dataset basis. See our repository for more information.

OpenVLA models can be used zero - shot to control robots for specific combinations of embodiments and domains seen in the Open - X pretraining mixture (e.g., for [BridgeV2 environments with a Widow - X robot](https://rail - berkeley.github.io/bridgedata/)). They can also be efficiently fine - tuned for new tasks and robot setups given minimal demonstration data; see here.

Out - of - Scope: OpenVLA models do not zero - shot generalize to new (unseen) robot embodiments, or setups that are not represented in the pretraining mix; in these cases, we suggest collecting a dataset of demonstrations on the desired setup, and fine - tuning OpenVLA models instead.

📄 License

All OpenVLA checkpoints and the training codebase are released under an MIT License.

📖 Citation

BibTeX:

@article{kim24openvla,
    title={OpenVLA: An Open-Source Vision-Language-Action Model},
    author={{Moo Jin} Kim and Karl Pertsch and Siddharth Karamcheti and Ted Xiao and Ashwin Balakrishna and Suraj Nair and Rafael Rafailov and Ethan Foster and Grace Lam and Pannag Sanketi and Quan Vuong and Thomas Kollar and Benjamin Burchfiel and Russ Tedrake and Dorsa Sadigh and Sergey Levine and Percy Liang and Chelsea Finn},
    journal = {arXiv preprint arXiv:2406.09246},
    year={2024}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご