RDT-170M Open-source Model - Freely Empowering Robot's Visual Language Action Task Imitation Learning

Rdt 170m

Developed by robotics-diffusion-transformer

RDT-170M is a 170-million-parameter imitation learning diffusion Transformer model designed for robot vision-language-action tasks.

Multimodal Fusion

Transformers

EnglishOpen Source License:MIT #Multimodal Robot Control #Diffusion Transformer #Vision-Language-Action

Downloads 278

Release Time : 10/23/2024

Model Overview

RDT-170M is a Transformer-based diffusion policy model capable of predicting the next 64 robot actions based on language instructions and multi-view RGB images, compatible with various mobile robotic arm platforms.

Model Features

Multimodal Input Support

Supports language instructions and up to three-view RGB image inputs.

Broad Compatibility

Compatible with single-arm/dual-arm, joint space/end-effector space, position control/velocity control, and various other robotic platforms.

Unified Action Space

Supports multiple robot control methods through a unified action space.

Large-scale Pretraining

Pretrained on 46 robot datasets.

Model Capabilities

Vision-language understanding

Robot action prediction

Multimodal fusion

Diffusion model inference

Use Cases

Robot Control

Mobile Robotic Arm Control

Controls a mobile robotic arm to perform tasks based on language instructions and visual inputs.

Can predict the next 64 robot actions.

Dual-arm Coordination

Controls a dual-arm robot to complete coordinated manipulation tasks.

🚀 RDT-170M

RDT-170M is a 170M-parameter imitation learning Diffusion Transformer (RDT(small) in ablation). It offers a hidden size of 1024 and a depth of 14, which are half of those in RDT-1B. Given language instructions and RGB images of up to three views, RDT can predict the next 64 robot actions. It is compatible with almost all modern mobile manipulators, covering single - arm to dual - arm, joint to EEF, position to velocity, and even wheeled locomotion.

All the code, pre - trained model weights, and data are licensed under the MIT license.

For more information, please refer to our project page and paper.

✨ Features

Versatile Compatibility: Compatible with a wide range of modern mobile manipulators, including single - arm, dual - arm, joint - based, EEF - based, position - controlled, velocity - controlled, and wheeled locomotion robots.
Multi - Modal Input: Accepts language instructions and RGB images from up to three views to predict robot actions.
Open - Source and Licensed: All code, pre - trained model weights, and data are available under the MIT license.

📚 Documentation

Model Details

Property	Details
Developed by	The RDT team consisting of researchers from the TSAIL group at Tsinghua University
Task Type	Vision - Language - Action (language, image => robot actions)
Model Type	Diffusion Policy with Transformers
License	MIT
Language(s) (NLP)	en
Multi - Modal Encoders	Vision Backbone: [siglip - so400m - patch14 - 384](https://huggingface.co/google/siglip - so400m - patch14 - 384) Language Model: [t5 - v1_1 - xxl](https://huggingface.co/google/t5 - v1_1 - xxl)
Pre - Training Datasets	46 datasets including [RT - 1 Dataset](https://robotics - transformer1.github.io/), RH20T, [DROID](https://droid - dataset.github.io/), [BridgeData V2](https://rail - berkeley.github.io/bridgedata/), RoboSet, and a subset of [Open X - Embodiment](https://robotics - transformer - x.github.io/). See [this link](https://github.com/thu - ml/RoboticsDiffusionTransformer/blob/main/docs/pretrain.md#download - and - prepare - datasets) for a detailed list.
Repository	https://github.com/thu - ml/RoboticsDiffusionTransformer
Paper	https://arxiv.org/pdf/2410.07864
Project Page	https://rdt - robotics.github.io/rdt - robotics/

Uses

RDT takes language instructions, RGB images (up to three views), control frequency (if any), and proprioception as input and predicts the next 64 robot actions. It supports the control of almost all robot manipulators through a unified action space, which includes all the main physical quantities of the robot manipulator (e.g., end - effector and joint, position and velocity, and wheeled locomotion). To deploy it on your robot platform, you need to fill the relevant quantities of the raw action vector into the unified space vector. For more information, see [our repository](https://github.com/thu - ml/RoboticsDiffusionTransformer).

⚠️ Important Note

Due to the embodiment gap, RDT cannot yet generalize to new robot platforms (not seen in the pre - training datasets). In this case, we recommend collecting a small dataset of the target robot and then using it to fine - tune RDT. See [our repository](https://github.com/thu - ml/RoboticsDiffusionTransformer) for a tutorial.

💻 Usage Examples

Basic Usage

# Please first clone the repository and install dependencies
# Then switch to the root directory of the repository by "cd RoboticsDiffusionTransformer"

# Import a create function from the code base
from scripts.agilex_model import create_model

# Names of cameras used for visual input
CAMERA_NAMES = ['cam_high', 'cam_right_wrist', 'cam_left_wrist']
config = {
    'episode_len': 1000,  # Max length of one episode
    'state_dim': 14,      # Dimension of the robot's state
    'chunk_size': 64,     # Number of actions to predict in one step
    'camera_names': CAMERA_NAMES,
}
pretrained_vision_encoder_name_or_path = "google/siglip-so400m-patch14-384" 
# Create the model with the specified configuration
model = create_model(
    args=config,
    dtype=torch.bfloat16, 
    pretrained_vision_encoder_name_or_path=pretrained_vision_encoder_name_or_path,
    pretrained='robotics-diffusion-transformer/rdt-1b',
    control_frequency=25,
)

# Start inference process
# Load the pre-computed language embeddings
# Refer to scripts/encode_lang.py for how to encode the language instruction
lang_embeddings_path = 'your/language/embedding/path'
text_embedding = torch.load(lang_embeddings_path)['embeddings']  
images: List(PIL.Image) = ... #  The images from last 2 frames
proprio = ... # The current robot state
# Perform inference to predict the next `chunk_size` actions
actions = policy.step(
    proprio=proprio,
    images=images,
    text_embeds=text_embedding 
)

📄 License

All the code, pre - trained model weights, and data of RDT - 170M are licensed under the MIT license.

📚 Citation

If you find our work helpful, please cite us:

@article{liu2024rdt,
  title={RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation},
  author={Liu, Songming and Wu, Lingxuan and Li, Bangguo and Tan, Hengkai and Chen, Huayu and Wang, Zhengyi and Xu, Ke and Su, Hang and Zhu, Jun},
  journal={arXiv preprint arXiv:2410.07864},
  year={2024}
}

Thank you!

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご