Octo-small-1.5 Open-source Robot Control Model - Predicting Actions Based on Visual and Language Instructions

Octo Small 1.5

Developed by rail-berkeley

Octo Small is a diffusion policy model for robot control, based on Transformer architecture, capable of predicting robot actions from visual inputs and language instructions.

Multimodal Fusion

Transformers

Open Source License:MIT #Multi-view robot control #Diffusion policy prediction #Lightweight Transformer

Downloads 250

Release Time : 5/21/2024

Model Overview

This model is a 27-million-parameter Transformer architecture designed specifically for robot control tasks. It processes visual inputs (main camera and wrist camera images) and language instructions to predict 4-step sequences of 7-dimensional actions. The model is trained using diffusion policy with a window size of 2.

Model Features

Multimodal input processing

Capable of processing both visual inputs (camera images) and language instructions

Diffusion policy

Trained using diffusion policy to predict 4-step sequences of 7-dimensional actions

Lightweight architecture

27-million-parameter Transformer architecture suitable for real-time robot control

Extensive dataset training

Trained on the Open X-Embodiment mixed dataset containing 25 different robot datasets

Model Capabilities

Vision-language multimodal processing

Robot action prediction

Real-time control

Multi-task learning

Use Cases

Robot control

Vision-based object grasping

Controls robot to grasp specific objects based on camera input and language instructions

Tabletop manipulation tasks

Performs various manipulation tasks in tabletop environments, such as pushing, pulling, rotating, etc.

Industrial automation

Assembly line operations

Performs precise assembly tasks in industrial environments

🚀 Octo Small

Octo Small is a model in the robotics field. It predicts 7 - dimensional actions 4 steps into the future using a diffusion policy with a window size of 2 during training. This model offers a practical solution for robotics applications, leveraging advanced techniques to handle observations and tasks effectively.

🚀 Quick Start

See https://github.com/octo-models/octo for instructions for using this model.

✨ Features

Octo Small is trained with a window size of 2, predicting 7 - dimensional actions 4 steps into the future using a diffusion policy.
The model is a Transformer with 27M parameters (equivalent to a ViT - S).
Images are tokenized by preprocessing with a lightweight convolutional encoder, then grouped into 16x16 patches.
Language is tokenized by applying the T5 tokenizer, and then applying the T5 - Base language encoder.

🔧 Technical Details

Observations and Tasks Specification

Observations and tasks conform to the following spec:

Observations

{
    image_primary: ('batch', 'history_window', 256, 256, 3),
    image_wrist: ('batch', 'history_window', 128, 128, 3),
}

Tasks

{
    image_primary: ('batch', 256, 256, 3),
    image_wrist: ('batch', 128, 128, 3),
    language_instruction: {
        attention_mask: ('batch', 16),
        input_ids: ('batch', 16),
    },
}

At inference, you may pass in any subset of these observation and task keys, with a history window up to 2 timesteps.

Training Datasets

This model was trained on a mix of datasets from the Open X - Embodiment dataset.

Dataset	Proportion of batch
Fractal (Brohan et al, 2022)	17.0%
Kuka (Kalashnikov et al, 2018)	17.0%
Bridge (Walke et al, 2023)	17.0%
BC - Z (Jang et al, 2022)	9.1%
Stanford Hydra Dataset (Belkhale et al, 2023)	6.0%
Language Table~ (Lynch et al, 2023)	5.9%
Taco Play (Rosete - Beas et al, 2022, Mees et al., 2023)	3.6%
Furniture Bench Dataset (Heo et al, 2023)	3.3%
UTAustin Mutex (Shah et al, 2023)	3.0%
Austin Sailor Dataset (Nasiriany et al, 2022)	2.9%
Roboturk (Mandlekar et al, 2018)	2.8%
Toto (Zhou et al, 2023)	2.4%
Austin Sirius Dataset (Liu et al, 2023)	2.3%
Berkeley Autolab UR5 (Chen et al)	1.5%
IAMLab CMU Pickup Insert (Saxena et al, 2023)	1.2%
Viola (Zhu et al, 2023)	1.2%
Berkeley Fanuc Manipulation (Zhu et al, 2023)	1.0%
NYU Franka Play Dataset (Cui et al, 2022)	0.9%
UCSD Kitchen Dataset (Ge Yan and Wang, 2023)	<0.1%
Jaco Play (Dass et al, 2023)	0.6%
Berkeley Cable Routing (Luo et al, 2023)	0.3%
Austin Buds Dataset (Zhu et al, 2022)	0.3%
CMU Stretch (Mendonca et al, 2023)	0.2%
NYU Door Opening (Pari et al, 2021)	0.1%
DLR EDAN Shared Control (Quere et al, 2020)	0.1%

📚 Documentation

Updates for Version 1.5

Language task tokens are now repeated at every timestep in the context window.
Augmented the language instructions in the data with rephrasings from GPT - 3.5.
Bug fixes:
- Turned off dropout in the diffusion head due to incompatibility with layer norm.
- Fixed an off - by - one error with the attention mask.
- Fixed an issue where different image augmentations did not get fresh random seeds.

📄 License

This project is licensed under the MIT license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご