Open-source RDT-1b model: Based on millions of data, supporting multi-perspective visual-linguistic action prediction.

Rdt 1b

Developed by robotics-diffusion-transformer

A 1-billion-parameter imitation learning diffusion Transformer model pretrained on 1M+ multi-robot operation data, supporting multi-view visual-language-action prediction

Multimodal Fusion

Transformers

EnglishOpen Source License:MIT #Multimodal Robot Control #Diffusion Transformer #Multi-view Visual Instruction

Downloads 2,644

Release Time : 8/27/2024

Model Overview

This model can predict future 64 robot actions based on language instructions and multi-view RGB images, compatible with various modern mobile robotic arm systems

Model Features

Multimodal Input Support

Simultaneously processes language instructions and up to three-view RGB image inputs

Universal Robot Compatibility

Supports various robotic platforms including single/dual arms, joint/end-effector space, position/velocity control

Large-scale Pretraining

Trained on 1M+ robot operation data and 46 public datasets

Long-sequence Action Prediction

Capable of predicting future 64 continuous robot actions

Model Capabilities

Vision-language understanding

Robot action sequence prediction

Multi-view image processing

Cross-platform robot control

Use Cases

Industrial Automation

Assembly Line Operation

Complete part grasping and assembly tasks based on language instructions

Achieves precise continuous motion control

Service Robots

Home Organization

Identify and organize household items based on voice commands

Completes complex multi-step operation sequences

🚀 RDT-1B

RDT-1B is a 1B-parameter imitation learning Diffusion Transformer pre-trained on 1M+ multi-robot episodes. Given language instruction and RGB images of up to three views, it can predict the next 64 robot actions. RDT is compatible with almost all modern mobile manipulators, covering various types such as single-arm to dual-arm, joint to EEF, position to velocity, and even wheeled locomotion.

All the code, pre-trained model weights, and data are licensed under the MIT license.

Please refer to our project page and paper for more information.

🚀 Quick Start

RDT-1B is a powerful model for robotics. It can predict robot actions based on language instructions and RGB images. To get started, you can access the code, pre - trained model weights, and data from the provided links.

✨ Features

Powerful Prediction: Given language instruction and up to three - view RGB images, RDT can predict the next 64 robot actions.
Wide Compatibility: Compatible with almost all modern mobile manipulators, including single - arm, dual - arm, joint, EEF, position, velocity, and wheeled locomotion types.

📚 Documentation

Model Details

Property	Details
Developed by	The RDT team consisting of researchers from the TSAIL group at Tsinghua University
Task Type	Vision - Language - Action (language, image => robot actions)
Model Type	Diffusion Policy with Transformers
License	MIT
Language(s) (NLP)	en
Vision Backbone	[siglip - so400m - patch14 - 384](https://huggingface.co/google/siglip - so400m - patch14 - 384)
Language Model	[t5 - v1_1 - xxl](https://huggingface.co/google/t5 - v1_1 - xxl)
Pre - Training Datasets	46 datasets including [RT - 1 Dataset](https://robotics - transformer1.github.io/), RH20T, [DROID](https://droid - dataset.github.io/), [BridgeData V2](https://rail - berkeley.github.io/bridgedata/), RoboSet, and a subset of [Open X - Embodiment](https://robotics - transformer - x.github.io/). See [this link](https://github.com/thu - ml/RoboticsDiffusionTransformer/blob/main/docs/pretrain.md#download - and - prepare - datasets) for a detailed list.
Repository	https://github.com/thu - ml/RoboticsDiffusionTransformer
Paper	https://arxiv.org/pdf/2410.07864
Project Page	https://rdt - robotics.github.io/rdt - robotics/

Uses

RDT takes language instruction, RGB images (of up to three views), control frequency (if any), and proprioception as input and predicts the next 64 robot actions. It supports the control of almost all robot manipulators with the help of the unified action space, which includes all the main physical quantities of the robot manipulator. To deploy on your robot platform, you need to fill the relevant quantities of the raw action vector into the unified space vector. See [our repository](https://github.com/thu - ml/RoboticsDiffusionTransformer) for more information.

⚠️ Important Note

Due to the embodiment gap, RDT cannot yet generalize to new robot platforms (not seen in the pre - training datasets). In this case, we recommend collecting a small dataset of the target robot and then using it to fine - tune RDT. See [our repository](https://github.com/thu - ml/RoboticsDiffusionTransformer) for a tutorial.

💻 Usage Examples

Basic Usage

# Please first clone the repository and install dependencies
# Then switch to the root directory of the repository by "cd RoboticsDiffusionTransformer"

# Import a create function from the code base
from scripts.agilex_model import create_model

# Names of cameras used for visual input
CAMERA_NAMES = ['cam_high', 'cam_right_wrist', 'cam_left_wrist']
config = {
    'episode_len': 1000,  # Max length of one episode
    'state_dim': 14,      # Dimension of the robot's state
    'chunk_size': 64,     # Number of actions to predict in one step
    'camera_names': CAMERA_NAMES,
}
pretrained_vision_encoder_name_or_path = "google/siglip-so400m-patch14-384" 
# Create the model with the specified configuration
model = create_model(
    args=config,
    dtype=torch.bfloat16, 
    pretrained_vision_encoder_name_or_path=pretrained_vision_encoder_name_or_path,
    pretrained='robotics-diffusion-transformer/rdt-1b',
    control_frequency=25,
)

# Start inference process
# Load the pre-computed language embeddings
# Refer to scripts/encode_lang.py for how to encode the language instruction
lang_embeddings_path = 'your/language/embedding/path'
text_embedding = torch.load(lang_embeddings_path)['embeddings']  
images: List(PIL.Image) = ... #  The images from last 2 frames
proprio = ... # The current robot state
# Perform inference to predict the next `chunk_size` actions
actions = policy.step(
    proprio=proprio,
    images=images,
    text_embeds=text_embedding 
)

📄 License

All the code, pre - trained model weights, and data are licensed under the MIT license.

📖 Citation

If you find our work helpful, please cite us:

@article{liu2024rdt,
  title={RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation},
  author={Liu, Songming and Wu, Lingxuan and Li, Bangguo and Tan, Hengkai and Chen, Huayu and Wang, Zhengyi and Xu, Ke and Su, Hang and Zhu, Jun},
  journal={arXiv preprint arXiv:2410.07864},
  year={2024}
}

Thank you!

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご