CogACT - Base Open-source Vision-Language-Action Model - Free Support for Robots to Complete Manipulation Tasks

Cogact Base

Developed by CogACT

CogACT is a novel Vision-Language-Action (VLA) architecture that combines vision-language models with specialized action modules for robotic manipulation tasks.

Multimodal Fusion

Transformers

EnglishOpen Source License:MIT #Vision-Language-Action Model #Robot Manipulation Control #Multimodal Pretraining

Downloads 6,589

Release Time : 11/29/2024

Model Overview

CogACT is an advanced Vision-Language-Action (VLA) architecture derived from Vision-Language Models (VLM), enabling the translation of language instructions and visual inputs into robotic actions through a component-based design.

Model Features

Component-based Architecture

Employs separate vision, language, and action modules instead of simple quantization-based modifications of VLM.

Multimodal Fusion

Integrates vision, language, and action modalities to accomplish complex robotic manipulation tasks.

Zero-shot Transfer Capability

Can be applied zero-shot to robotic configurations in the Open-X pretraining mixed dataset.

Rapid Adaptation to New Tasks

Can be fine-tuned for new tasks and robotic configurations with minimal demonstration samples.

Model Capabilities

Vision-Language Understanding

Robot Action Prediction

Multimodal Fusion

Zero-shot Transfer Learning

Use Cases

Robot Manipulation

Object Grasping and Placement

Predicts action sequences for grasping and placing objects based on language instructions and visual inputs.

Generates standardized 16-step, 7-DOF robotic actions.

Task-Oriented Manipulation

Executes complex tasks such as 'move the sponge near the apple' based on instructions.

Generates precise action sequences through conditioned diffusion models.

🚀 CogACT-Base

CogACT is a novel and advanced Vision-Language-Action (VLA) architecture developed from Vision-Language Model (VLM). Different from previous approaches that directly adapt VLM for action prediction through simple action quantization, we introduce a componentized VLA architecture with a specialized action module based on VLM output. CogACT-Base uses a DiT-Base model as its action module.

All our code and pre-trained model weights are licensed under the MIT license.

For more details, please visit our project page and paper.

🚀 Quick Start

All our code, pre-trained model weights, are licensed under the MIT license. You can access the relevant resources through the provided links. Please refer to our project page and paper for more details.

✨ Features

Innovative Architecture: CogACT is a new VLA architecture derived from VLM, with a specialized action module conditioned on VLM output.
Flexible Usage: Can be used zero - shot for setups seen in the pretraining mixture, and can also be fine - tuned for new tasks with a small amount of demonstrations.

📚 Documentation

Model Summary

Property	Details
Developed by	The CogACT consisting of researchers from Microsoft Research Asia.
Model Type	Vision - Language - Action (language, image => robot actions)
Language(s) (NLP)	en
License	MIT
Model components	Vision Backbone: DINOv2 ViT - L/14 and SigLIP ViT - So400M/14 Language Model: Llama - 2 Action Model: DiT - Base
Pretraining Dataset	A subset of Open X - Embodiment
Repository	https://github.com/microsoft/CogACT
Paper	CogACT: A Foundational Vision - Language - Action Model for Synergizing Cognition and Action in Robotic Manipulation
Project Page	https://cogact.github.io/

Uses

CogACT takes a language instruction and a single - view RGB image as input and predicts the next 16 normalized robot actions (consisting of the 7 - DoF end - effector deltas of the form x, y, z, roll, pitch, yaw, gripper). These actions should be unnormalized and integrated by our Adaptive Action Ensemble(Optional). Unnormalization and ensemble depend on the dataset statistics.

CogACT models can be used zero - shot to control robots for setups seen in the Open - X pretraining mixture. They can also be fine - tuned for new tasks and robot setups with an extremely small amount of demonstrations. See our repository for more information.

💻 Usage Examples

Basic Usage

# Please clone and install dependencies in our repo
# Install minimal dependencies (`torch`, `transformers`, `timm`, `tokenizers`, ...)

from PIL import Image
from vla import load_vla
import torch

model = load_vla(
      'CogACT/CogACT-Base',
      load_for_training=False,
      action_model_type='DiT-B',
      future_action_window_size=15,
    )                                 
# about 30G Memory in fp32; 

# (Optional) use "model.vlm = model.vlm.to(torch.bfloat16)" to load vlm in bf16

model.to('cuda:0').eval()

image: Image.Image = <input_your_image>
prompt = "move sponge near apple"           # input your prompt

# Predict Action (7-DoF; un-normalize for RT-1 google robot data, i.e. fractal20220817_data)
actions, _ = model.predict_action(
          image,
          prompt,
          unnorm_key='fractal20220817_data', # input your unnorm_key of dataset
          cfg_scale = 1.5,                   # cfg from 1.5 to 7 also performs well
          use_ddim = True,                   # use DDIM sampling
          num_ddim_steps = 10,               # number of steps for DDIM sampling
        )

# results in 7-DoF actions of 16 steps with shape [16, 7]

📄 License

All our code, pre-trained model weights, are licensed under the MIT license.

📚 Citation

@article{li2024cogact,
  title={CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation},
  author={Li, Qixiu and Liang, Yaobo and Wang, Zeyu and Luo, Lin and Chen, Xi and Liao, Mozheng and Wei, Fangyun and Deng, Yu and Xu, Sicheng and Zhang, Yizhong and others},
  journal={arXiv preprint arXiv:2411.19650},
  year={2024}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご