Cogact Large
CogACT is a novel advanced Vision-Language-Action (VLA) architecture derived from Vision-Language Models (VLM), specifically designed for robot manipulation.
Downloads 122
Release Time : 11/30/2024
Model Overview
CogACT is a modular Vision-Language-Action model that predicts robot actions by conditioning on the output of a vision-language model through a dedicated action module. It supports zero-shot application to robot configurations present in pre-training datasets and can adapt to new tasks with minimal fine-tuning.
Model Features
Modular Architecture
Employs separate vision, language, and action modules instead of simply modifying VLM for action prediction.
Adaptive Action Integration
Supports action denormalization and integration to adapt to different dataset statistical properties.
Zero-shot Transfer Capability
Can be directly applied to robot configurations in the Open-X pre-training mixed dataset.
Few-shot Fine-tuning
Adapts to new tasks and robot configurations with very few demonstration samples.
Model Capabilities
Vision-Language Understanding
Robot Action Prediction
Multimodal Task Processing
Zero-shot Transfer Learning
Use Cases
Robot Manipulation
Object Grasping and Placing
Predicts action sequences for grasping and placing objects based on language instructions and visual input.
Can generate 16-step, 7-degree-of-freedom standardized robot actions.
Task-Oriented Manipulation
Executes specific task instructions such as 'move the sponge near the apple'.
Generates precise action sequences through diffusion models.
Featured Recommended AI Models
Š 2025AIbase