CogACT-Large Open-Source Model - Customized for Robot Manipulation, Enabling Advanced Visual-Language-Action Applications

Cogact Large

Developed by CogACT

CogACT is a novel advanced Vision-Language-Action (VLA) architecture derived from Vision-Language Models (VLM), specifically designed for robot manipulation.

Multimodal Fusion

Transformers

EnglishOpen Source License:MIT #Vision-Language-Action Model #Robot Manipulation Control #Multimodal Diffusion Model

Downloads 122

Release Time : 11/30/2024

Model Overview

CogACT is a modular Vision-Language-Action model that predicts robot actions by conditioning on the output of a vision-language model through a dedicated action module. It supports zero-shot application to robot configurations present in pre-training datasets and can adapt to new tasks with minimal fine-tuning.

Model Features

Modular Architecture

Employs separate vision, language, and action modules instead of simply modifying VLM for action prediction.

Adaptive Action Integration

Supports action denormalization and integration to adapt to different dataset statistical properties.

Zero-shot Transfer Capability

Can be directly applied to robot configurations in the Open-X pre-training mixed dataset.

Few-shot Fine-tuning

Adapts to new tasks and robot configurations with very few demonstration samples.

Model Capabilities

Vision-Language Understanding

Robot Action Prediction

Multimodal Task Processing

Zero-shot Transfer Learning

Use Cases

Robot Manipulation

Object Grasping and Placing

Predicts action sequences for grasping and placing objects based on language instructions and visual input.

Can generate 16-step, 7-degree-of-freedom standardized robot actions.

Task-Oriented Manipulation

Executes specific task instructions such as 'move the sponge near the apple'.

Generates precise action sequences through diffusion models.

🚀 CogACT-Large

CogACT is a novel and advanced Vision-Language-Action (VLA) architecture developed from Vision-Language Model (VLM). Different from previous approaches that simply repurpose VLM for action prediction through basic action quantization, we present a componentized VLA architecture featuring a specialized action module based on VLM output. CogACT-Large utilizes a DiT-L model as its action module.

All our code and pretrained model weights are licensed under the MIT license.

For more details, please refer to our project page and paper.

🚀 Quick Start

CogACT takes a language instruction and a single - view RGB image as input and predicts the next 16 normalized robot actions (comprising the 7 - DoF end - effector deltas in the form of x, y, z, roll, pitch, yaw, gripper). These actions should be unnormalized and integrated by our Adaptive Action Ensemble (Optional). Unnormalization and ensemble rely on the dataset statistics.

CogACT models can be used zero - shot to control robots for setups seen in the [Open - X](https://robotics - transformer - x.github.io/) pretraining mixture. They can also be fine - tuned for new tasks and robot setups with an extremely small number of demonstrations. See our repository for more information.

✨ Features

Advanced Architecture: A componentized VLA architecture with a specialized action module.
Versatile Usage: Can be used for zero - shot robot control and fine - tuned for new tasks.

📦 Installation

Please clone and install dependencies in our repo. Install minimal dependencies (torch, transformers, timm, tokenizers, ...).

💻 Usage Examples

Basic Usage

# Please clone and install dependencies in our repo
# Install minimal dependencies (`torch`, `transformers`, `timm`, `tokenizers`, ...)

from PIL import Image
from vla import load_vla
import torch

model = load_vla(
      'CogACT/CogACT-Large',
      load_for_training=False,
      action_model_type='DiT-L',
      future_action_window_size=15,
    )                                 
# about 30G Memory in fp32; 

# (Optional) use "model.vlm = model.vlm.to(torch.bfloat16)" to load vlm in bf16

model.to('cuda:0').eval()

image: Image.Image = <input_your_image>
prompt = "move sponge near apple"           # input your prompt

# Predict Action (7-DoF; un-normalize for RT-1 google robot data, i.e. fractal20220817_data)
actions, _ = model.predict_action(
          image,
          prompt,
          unnorm_key='fractal20220817_data', # input your unnorm_key of dataset
          cfg_scale = 1.5,                   # cfg from 1.5 to 7 also performs well
          use_ddim = True,                   # use DDIM sampling
          num_ddim_steps = 10,               # number of steps for DDIM sampling
        )

# results in 7-DoF actions of 16 steps with shape [16, 7]

📚 Documentation

Model Summary

Property	Details
Developed by	The CogACT consisting of researchers from Microsoft Research Asia.
Model Type	Vision - Language - Action (language, image => robot actions)
Language(s) (NLP)	en
License	MIT
Model components	Vision Backbone: DINOv2 ViT - L/14 and SigLIP ViT - So400M/14 Language Model: Llama - 2 Action Model: DiT - Large
Pretraining Dataset	A subset of [Open X - Embodiment](https://robotics - transformer - x.github.io/)
Repository	https://github.com/microsoft/CogACT
Paper	CogACT: A Foundational Vision - Language - Action Model for Synergizing Cognition and Action in Robotic Manipulation
Project Page	https://cogact.github.io/

📄 License

All our code and pretrained model weights are licensed under the MIT license.

📄 Citation

@article{li2024cogact,
  title={CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation},
  author={Li, Qixiu and Liang, Yaobo and Wang, Zeyu and Luo, Lin and Chen, Xi and Liao, Mozheng and Wei, Fangyun and Deng, Yu and Xu, Sicheng and Zhang, Yizhong and others},
  journal={arXiv preprint arXiv:2411.19650},
  year={2024}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご