OpenVLA v0.1 7B開源模型 - 支持多種機器人控制的視覺語言動作工具

首頁

Openvla V01 7b

由openvla開發

OpenVLA v0.1 7B是一個開源視覺-語言-動作模型，基於Open X-Embodiment數據集訓練，支持多種機器人控制。

文本生成圖像

Transformers

英語開源協議:MIT #機器人動作控制 #多模態視覺語言 #零樣本泛化

下載量 30

發布時間 : 6/10/2024

模型概述

OpenVLA v0.1 7B是一個視覺-語言-動作模型，以語言指令和攝像頭圖像作為輸入，生成機器人動作。它支持開箱即用地控制多種機器人，並可通過微調快速適配新的機器人領域。

模型特點

多機器人支持

開箱即用地控制預訓練數據中已包含的多種機器人

高效微調

可通過少量演示數據高效微調以適應新任務和機器人設置

開源

所有檢查點和訓練代碼庫均以MIT許可證發佈

模型能力

機器人動作預測

視覺語言理解

多模態輸入處理

使用案例

機器人控制

零樣本機器人控制

在預訓練數據包含的機器人設置上零樣本執行指令

可控制如Widow-X機器人等預訓練數據中的機器人

新領域適配

通過微調快速適配新的機器人領域

需要收集目標設置上的演示數據集

🚀 OpenVLA v0.1 7B

OpenVLA v0.1 7B是一個開源的視覺 - 語言 - 動作模型，它基於Open X - Embodiment數據集進行訓練。該模型以語言指令和相機圖像作為輸入，能夠生成機器人動作，可直接控制多種機器人，還能通過（參數高效）微調快速適配新的機器人領域。

注意事項

OpenVLA v0.1是我們為開發目的而訓練的早期模型；若需獲取我們的最佳模型，請查看[openvla/openvla - 7b](https://huggingface.co/openvla/openvla - 7b)。

所有OpenVLA的檢查點以及我們的訓練代碼庫均在MIT許可證下發布。如需瞭解完整詳情，請閱讀我們的論文並查看我們的項目頁面。

🚀 快速開始

OpenVLA 7B可以直接用於控制預訓練混合集中所涵蓋領域的多種機器人。以下是一個在[BridgeV2環境]中使用Widow - X機器人進行零樣本指令跟蹤的示例，用於加載openvla - v01 - 7b：

# Install minimal dependencies (`torch`, `transformers`, `timm`, `tokenizers`, ...)
# > pip install -r https://raw.githubusercontent.com/openvla/openvla/main/requirements-min.txt
from transformers import AutoModelForVision2Seq, AutoProcessor
from PIL import Image

import torch

# Load Processor & VLA
processor = AutoProcessor.from_pretrained("openvla/openvla-7b", trust_remote_code=True)
vla = AutoModelForVision2Seq.from_pretrained(
    "openvla/openvla-v01-7b",
    attn_implementation="flash_attention_2",  # [Optional] Requires `flash_attn`
    torch_dtype=torch.bfloat16, 
    low_cpu_mem_usage=True, 
    trust_remote_code=True
).to("cuda:0")

# Grab image input & format prompt (note inclusion of system prompt due to Vicuña base model)
image: Image.Image = get_from_camera(...)
system_prompt = (
    "A chat between a curious user and an artificial intelligence assistant. "
    "The assistant gives helpful, detailed, and polite answers to the user's questions."
)
prompt = f"{system_prompt} USER: What action should the robot take to {<INSTRUCTION>}? ASSISTANT:"

# Predict Action (7-DoF; un-normalize for BridgeV2)
inputs = processor(prompt, image).to("cuda:0", dtype=torch.bfloat16)
action = vla.predict_action(**inputs, unnorm_key="bridge_orig", do_sample=False)

# Execute...
robot.act(action, ...)

如需更多示例，包括在您自己的機器人演示數據集上微調OpenVLA模型的腳本，請查看我們的訓練倉庫。

✨ 主要特性

多機器人控制：可直接控制多種機器人。
快速適配：能通過（參數高效）微調快速適配新的機器人領域。
零樣本使用：可對Open - X預訓練混合集中看到的特定組合的實體和領域進行零樣本機器人控制。

📚 詳細文檔

模型概述

開發者：OpenVLA團隊，成員來自斯坦福大學、加州大學伯克利分校、谷歌Deepmind和豐田研究院。
模型類型：視覺 - 語言 - 動作（語言、圖像 => 機器人動作）
語言（NLP）：英語
許可證：MIT
微調基礎模型：[siglip - 224px](https://github.com/TRI - ML/prismatic - vlms)，這是一個視覺語言模型，其訓練基礎為：
- 視覺骨幹網絡：SigLIP ViT - So400M/14
- 語言模型：Vicuna v1.5
預訓練數據集：[Open X - Embodiment](https://robotics - transformer - x.github.io/)，具體的組件數據集可在此處找到。
倉庫地址：https://github.com/openvla/openvla
論文：OpenVLA: An Open - Source Vision - Language - Action Model
項目頁面和視頻：https://openvla.github.io/

使用方法

OpenVLA模型以語言指令和機器人工作空間的相機圖像作為輸入，預測由7自由度末端執行器增量組成的（歸一化）機器人動作，形式為（x, y, z, 滾動, 俯仰, 偏航, 抓手）。要在實際的機器人平臺上執行，動作需要根據每個機器人、每個數據集計算的統計數據進行反歸一化。更多信息請查看我們的倉庫。

OpenVLA模型可以進行零樣本使用，以控制Open - X預訓練混合集中看到的特定實體和領域組合的機器人（例如，[帶有Widow - X機器人的BridgeV2環境](https://rail - berkeley.github.io/bridgedata/)）。在給定最少的演示數據的情況下，它們還可以針對新任務和機器人設置進行高效的微調；詳情請見此處。

適用範圍說明

OpenVLA模型不能對新的（未見過的）機器人實體或預訓練混合集中未涵蓋的設置進行零樣本泛化；在這些情況下，我們建議在所需的設置上收集演示數據集，並對OpenVLA模型進行微調。

📄 許可證

本項目採用MIT許可證。

📖 引用

@article{kim24openvla,
    title={OpenVLA: An Open-Source Vision-Language-Action Model},
    author={{Moo Jin} Kim and Karl Pertsch and Siddharth Karamcheti and Ted Xiao and Ashwin Balakrishna and Suraj Nair and Rafael Rafailov and Ethan Foster and Grace Lam and Pannag Sanketi and Quan Vuong and Thomas Kollar and Benjamin Burchfiel and Russ Tedrake and Dorsa Sadigh and Sergey Levine and Percy Liang and Chelsea Finn},
    journal = {arXiv preprint arXiv:2406.09246},
    year={2024}
}