CogACT - Base開源視覺語言動作模型 - 免費助力機器人完成操作任務

首頁

Cogact Base

由CogACT開發

CogACT是一種新型視覺語言動作(VLA)架構，結合視覺語言模型與專用動作模塊，用於機器人操作任務。

多模態融合

Transformers

英語開源協議:MIT #視覺語言動作模型 #機器人操作控制 #多模態預訓練

下載量 6,589

發布時間 : 11/29/2024

模型概述

CogACT是一種源自視覺語言模型(VLM)的高級視覺語言動作(VLA)架構，通過組件化設計實現語言指令和視覺輸入到機器人動作的轉換。

模型特點

組件化架構

採用分離的視覺、語言和動作模塊設計，而非簡單量化改造VLM

多模態融合

整合視覺、語言和動作模態，實現複雜機器人操作任務

零樣本遷移能力

可零樣本應用於Open-X預訓練混合數據集中的機器人配置

快速適應新任務

通過少量演示樣本即可對新任務和機器人配置進行微調

模型能力

視覺語言理解

機器人動作預測

多模態融合

零樣本遷移學習

使用案例

機器人操作

物體抓取與放置

根據語言指令和視覺輸入預測抓取和放置物體的動作序列

可生成16步7自由度的標準化機器人動作

任務導向操作

執行復雜任務如"將海綿移到蘋果附近"等指令

通過條件化擴散模型生成精確動作序列

🚀 CogACT-Base

CogACT是一種源自視覺語言模型（VLM）的新型高級視覺語言行動（VLA）架構。與以往通過簡單的動作量化直接將VLM用於動作預測的工作不同，我們提出了一種組件化的VLA架構，該架構有一個基於VLM輸出的專門動作模塊。CogACT-Base採用DiT-Base模型作為動作模塊。

我們所有的代碼、預訓練模型權重均遵循MIT許可證。

更多詳情請參考我們的項目頁面和論文。

🚀 快速開始

CogACT接收語言指令和單視角RGB圖像作為輸入，並預測接下來的16個歸一化機器人動作（由形式為 x, y, z, roll, pitch, yaw, gripper 的7自由度末端執行器增量組成）。這些動作應通過我們的 Adaptive Action Ensemble（可選）進行反歸一化和集成。反歸一化和集成取決於數據集統計信息。

CogACT模型可以零樣本地用於控制在Open-X預訓練混合數據中出現過的機器人設置。它們也可以通過極少量的演示數據針對新任務和機器人設置進行微調。更多信息請參閱我們的代碼庫。

✨ 主要特性

提出組件化的VLA架構，有專門的動作模塊。
可零樣本控制機器人，也可針對新任務和設置微調。

📦 安裝指南

文檔未提供具體安裝步驟，可參考代碼庫中的說明進行安裝。

💻 使用示例

基礎用法

# Please clone and install dependencies in our repo
# Install minimal dependencies (`torch`, `transformers`, `timm`, `tokenizers`, ...)

from PIL import Image
from vla import load_vla
import torch

model = load_vla(
      'CogACT/CogACT-Base',
      load_for_training=False,
      action_model_type='DiT-B',
      future_action_window_size=15,
    )                                 
# about 30G Memory in fp32; 

# (Optional) use "model.vlm = model.vlm.to(torch.bfloat16)" to load vlm in bf16

model.to('cuda:0').eval()

image: Image.Image = <input_your_image>
prompt = "move sponge near apple"           # input your prompt

# Predict Action (7-DoF; un-normalize for RT-1 google robot data, i.e. fractal20220817_data)
actions, _ = model.predict_action(
          image,
          prompt,
          unnorm_key='fractal20220817_data', # input your unnorm_key of dataset
          cfg_scale = 1.5,                   # cfg from 1.5 to 7 also performs well
          use_ddim = True,                   # use DDIM sampling
          num_ddim_steps = 10,               # number of steps for DDIM sampling
        )

# results in 7-DoF actions of 16 steps with shape [16, 7]

📚 詳細文檔

模型概述

屬性	詳情
開發者	由來自微軟亞洲研究院的研究人員組成的CogACT團隊。
模型類型	視覺 - 語言 - 動作（語言、圖像 => 機器人動作）
語言（NLP）	英語
許可證	MIT
模型組件	視覺骨幹網絡：DINOv2 ViT-L/14和SigLIP ViT-So400M/14；語言模型：Llama-2；動作模型：DiT-Base
預訓練數據集	Open X-Embodiment的一個子集
代碼庫	https://github.com/microsoft/CogACT
論文	CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation
項目頁面	https://cogact.github.io/

📄 許可證

本項目遵循MIT許可證。

📖 引用

如果您在研究中使用了CogACT，請使用以下BibTeX引用：

@article{li2024cogact,
  title={CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation},
  author={Li, Qixiu and Liang, Yaobo and Wang, Zeyu and Luo, Lin and Chen, Xi and Liao, Mozheng and Wei, Fangyun and Deng, Yu and Xu, Sicheng and Zhang, Yizhong and others},
  journal={arXiv preprint arXiv:2411.19650},
  year={2024}
}