### Octo-base-1.5 Open-source Robot Multi-modal Model - Predict Actions Based on Visual and Linguistic Inputs

Octo Base 1.5

Developed by rail-berkeley

Octo is a multimodal foundation model for robotics, capable of predicting robot actions through visual and language inputs.

Multimodal Fusion

Transformers

Open Source License:MIT #Multi-camera robot control #Diffusion policy action prediction #Language instruction driven

Downloads 87

Release Time : 5/21/2024

Model Overview

The Octo Base Model is a Transformer architecture that combines visual and language inputs, specifically designed for robot control tasks. It can process image inputs from both the main camera and wrist camera, and predict future actions in conjunction with language instructions.

Model Features

Multimodal input processing

Capable of processing both visual (dual cameras) and language inputs simultaneously

Diffusion policy prediction

Uses diffusion policy to predict 4-step 7-dimensional future actions

Flexible input support

During inference, any subset of observations and task keys can be passed in

Large-scale training data

Trained on 25 different robot datasets from the Open X-Embodiment dataset

Model Capabilities

Visual information processing

Language instruction understanding

Robot action prediction

Multimodal data fusion

Use Cases

Robot control

Vision-based object manipulation

Performs grasping, placing and other operations based on camera input and language instructions

Task-oriented action sequence generation

Generates action sequences required to complete specific tasks based on language descriptions

🚀 Octo Base

Octo Base is a robotics model. It can predict 7 - dimensional actions 4 steps into the future using a diffusion policy with a window size of 2. This model provides a solution for robotic action prediction and is trained on a mix of datasets from the Open X - Embodiment dataset.

🚀 Quick Start

See https://github.com/octo-models/octo for instructions for using this model.

✨ Features

Octo Base is trained with a window size of 2, predicting 7 - dimensional actions 4 steps into the future using a diffusion policy.
The model is a Transformer with 93M parameters (equivalent to a ViT - B).
Images are tokenized by preprocessing with a lightweight convolutional encoder, then grouped into 16x16 patches.
Language is tokenized by applying the T5 tokenizer, and then applying the T5 - Base language encoder.

🔧 Technical Details

Observation and Task Specification

Observations and tasks conform to the following spec:

Observations:

{
    image_primary: ('batch', 'history_window', 256, 256, 3),
    image_wrist: ('batch', 'history_window', 128, 128, 3),
}

Tasks:

{
    image_primary: ('batch', 256, 256, 3),
    image_wrist: ('batch', 128, 128, 3),
    language_instruction: {
        attention_mask: ('batch', 16),
        input_ids: ('batch', 16),
    },
}

At inference, you may pass in any subset of these observation and task keys, with a history window up to 2 timesteps.

Training Datasets

This model was trained on a mix of datasets from the Open X - Embodiment dataset.

Property	Details
Fractal (Brohan et al, 2022)	17.0%
Kuka (Kalashnikov et al, 2018)	17.0%
Bridge (Walke et al, 2023)	17.0%
BC - Z (Jang et al, 2022)	9.1%
Stanford Hydra Dataset (Belkhale et al, 2023)	6.0%
Language Table~ (Lynch et al, 2023)	5.9%
Taco Play (Rosete - Beas et al, 2022, Mees et al., 2023)	3.6%
Furniture Bench Dataset (Heo et al, 2023)	3.3%
UTAustin Mutex (Shah et al, 2023)	3.0%
Austin Sailor Dataset (Nasiriany et al, 2022)	2.9%
Roboturk (Mandlekar et al, 2018)	2.8%
Toto (Zhou et al, 2023)	2.4%
Austin Sirius Dataset (Liu et al, 2023)	2.3%
Berkeley Autolab UR5 (Chen et al)	1.5%
IAMLab CMU Pickup Insert (Saxena et al, 2023)	1.2%
Viola (Zhu et al, 2023)	1.2%
Berkeley Fanuc Manipulation (Zhu et al, 2023)	1.0%
NYU Franka Play Dataset (Cui et al, 2022)	0.9%
UCSD Kitchen Dataset (Ge Yan and Wang, 2023)	<0.1%
Jaco Play (Dass et al, 2023)	0.6%
Berkeley Cable Routing (Luo et al, 2023)	0.3%
Austin Buds Dataset (Zhu et al, 2022)	0.3%
CMU Stretch (Mendonca et al, 2023)	0.2%
NYU Door Opening (Pari et al, 2021)	0.1%
DLR EDAN Shared Control (Quere et al, 2020)	0.1%

📚 Documentation

Updates for Version 1.5

Language task tokens are now repeated at every timestep in the context window.
Augmented the language instructions in the data with rephrasings from GPT - 3.5.
Bug fixes:
- Turned off dropout in the diffusion head due to incompatibility with layer norm.
- Fixed an off - by - one error with the attention mask.
- Fixed an issue where different image augmentations did not get fresh random seeds.

📄 License

This project is licensed under the MIT license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご