O

Octo Base 1.5

Developed by rail-berkeley
Octo is a multimodal foundation model for robotics, capable of predicting robot actions through visual and language inputs.
Downloads 87
Release Time : 5/21/2024

Model Overview

The Octo Base Model is a Transformer architecture that combines visual and language inputs, specifically designed for robot control tasks. It can process image inputs from both the main camera and wrist camera, and predict future actions in conjunction with language instructions.

Model Features

Multimodal input processing
Capable of processing both visual (dual cameras) and language inputs simultaneously
Diffusion policy prediction
Uses diffusion policy to predict 4-step 7-dimensional future actions
Flexible input support
During inference, any subset of observations and task keys can be passed in
Large-scale training data
Trained on 25 different robot datasets from the Open X-Embodiment dataset

Model Capabilities

Visual information processing
Language instruction understanding
Robot action prediction
Multimodal data fusion

Use Cases

Robot control
Vision-based object manipulation
Performs grasping, placing and other operations based on camera input and language instructions
Task-oriented action sequence generation
Generates action sequences required to complete specific tasks based on language descriptions
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
Š 2025AIbase