đ RDT-170M
RDT-170M is a 170M-parameter imitation learning Diffusion Transformer (RDT(small) in ablation). It offers a hidden size of 1024
and a depth of 14
, which are half of those in RDT-1B. Given language instructions and RGB images of up to three views, RDT can predict the next 64 robot actions. It is compatible with almost all modern mobile manipulators, covering single - arm to dual - arm, joint to EEF, position to velocity, and even wheeled locomotion.
All the code, pre - trained model weights, and data are licensed under the MIT license.
For more information, please refer to our project page and paper.
⨠Features
- Versatile Compatibility: Compatible with a wide range of modern mobile manipulators, including single - arm, dual - arm, joint - based, EEF - based, position - controlled, velocity - controlled, and wheeled locomotion robots.
- Multi - Modal Input: Accepts language instructions and RGB images from up to three views to predict robot actions.
- Open - Source and Licensed: All code, pre - trained model weights, and data are available under the MIT license.
đ Documentation
Model Details
Property |
Details |
Developed by |
The RDT team consisting of researchers from the TSAIL group at Tsinghua University |
Task Type |
Vision - Language - Action (language, image => robot actions) |
Model Type |
Diffusion Policy with Transformers |
License |
MIT |
Language(s) (NLP) |
en |
Multi - Modal Encoders |
- Vision Backbone: [siglip - so400m - patch14 - 384](https://huggingface.co/google/siglip - so400m - patch14 - 384)
- Language Model: [t5 - v1_1 - xxl](https://huggingface.co/google/t5 - v1_1 - xxl)
|
Pre - Training Datasets |
46 datasets including [RT - 1 Dataset](https://robotics - transformer1.github.io/), RH20T, [DROID](https://droid - dataset.github.io/), [BridgeData V2](https://rail - berkeley.github.io/bridgedata/), RoboSet, and a subset of [Open X - Embodiment](https://robotics - transformer - x.github.io/). See [this link](https://github.com/thu - ml/RoboticsDiffusionTransformer/blob/main/docs/pretrain.md#download - and - prepare - datasets) for a detailed list. |
Repository |
https://github.com/thu - ml/RoboticsDiffusionTransformer |
Paper |
https://arxiv.org/pdf/2410.07864 |
Project Page |
https://rdt - robotics.github.io/rdt - robotics/ |
Uses
RDT takes language instructions, RGB images (up to three views), control frequency (if any), and proprioception as input and predicts the next 64 robot actions. It supports the control of almost all robot manipulators through a unified action space, which includes all the main physical quantities of the robot manipulator (e.g., end - effector and joint, position and velocity, and wheeled locomotion). To deploy it on your robot platform, you need to fill the relevant quantities of the raw action vector into the unified space vector. For more information, see [our repository](https://github.com/thu - ml/RoboticsDiffusionTransformer).
â ī¸ Important Note
Due to the embodiment gap, RDT cannot yet generalize to new robot platforms (not seen in the pre - training datasets). In this case, we recommend collecting a small dataset of the target robot and then using it to fine - tune RDT. See [our repository](https://github.com/thu - ml/RoboticsDiffusionTransformer) for a tutorial.
đģ Usage Examples
Basic Usage
from scripts.agilex_model import create_model
CAMERA_NAMES = ['cam_high', 'cam_right_wrist', 'cam_left_wrist']
config = {
'episode_len': 1000,
'state_dim': 14,
'chunk_size': 64,
'camera_names': CAMERA_NAMES,
}
pretrained_vision_encoder_name_or_path = "google/siglip-so400m-patch14-384"
model = create_model(
args=config,
dtype=torch.bfloat16,
pretrained_vision_encoder_name_or_path=pretrained_vision_encoder_name_or_path,
pretrained='robotics-diffusion-transformer/rdt-1b',
control_frequency=25,
)
lang_embeddings_path = 'your/language/embedding/path'
text_embedding = torch.load(lang_embeddings_path)['embeddings']
images: List(PIL.Image) = ...
proprio = ...
actions = policy.step(
proprio=proprio,
images=images,
text_embeds=text_embedding
)
đ License
All the code, pre - trained model weights, and data of RDT - 170M are licensed under the MIT license.
đ Citation
If you find our work helpful, please cite us:
@article{liu2024rdt,
title={RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation},
author={Liu, Songming and Wu, Lingxuan and Li, Bangguo and Tan, Hengkai and Chen, Huayu and Wang, Zhengyi and Xu, Ke and Su, Hang and Zhu, Jun},
journal={arXiv preprint arXiv:2410.07864},
year={2024}
}
Thank you!