C

Cogact Small

Developed by CogACT
CogACT is a novel advanced Vision-Language-Action (VLA) architecture derived from Vision-Language Models (VLM), specifically designed for robot manipulation.
Downloads 405
Release Time : 11/30/2024

Model Overview

CogACT is a modular vision-language-action model that transforms the output of vision-language models into robot action predictions through dedicated action modules.

Model Features

Modular Architecture
Employs separate vision, language, and action modules instead of directly modifying VLMs for action prediction.
Multimodal Fusion
Integrates visual and linguistic inputs to predict robot actions.
Zero-shot Transfer Capability
Can be applied zero-shot to robot configurations in the Open-X pretraining mixed dataset.
Rapid Adaptation to New Tasks
Can be fine-tuned for new tasks and robot configurations with few demonstration samples.

Model Capabilities

Vision-language understanding
Robot action prediction
Multimodal information processing
Zero-shot task execution

Use Cases

Robot Manipulation
Object Grasping and Placement
Predicts sequences of actions for grasping and placing objects based on language instructions and visual input.
Can generate 16-step standardized robot actions with 7 degrees of freedom.
Task-Oriented Manipulation
Executes specific task instructions such as 'move the sponge near the apple'.
Predicts precise motion trajectories through diffusion models.
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
Š 2025AIbase