RDT-170M開源模型 - 免費助力機器人視覺語言動作任務模仿學習

首頁

Rdt 170m

由robotics-diffusion-transformer開發

RDT-170M是一個擁有1.7億參數的模仿學習擴散Transformer模型，用於機器人視覺-語言-動作任務。

多模態融合

Transformers

英語開源協議:MIT #多模態機器人控制 #擴散Transformer #視覺語言動作

下載量 278

發布時間 : 10/23/2024

模型概述

RDT-170M是一個基於Transformer的擴散策略模型，能夠根據語言指令和多視角RGB圖像預測未來64個機器人動作，兼容多種移動機械臂平臺。

模型特點

多模態輸入支持

支持語言指令和多達三個視角的RGB圖像輸入

廣泛兼容性

兼容單臂/雙臂、關節空間/末端執行器空間、位置控制/速度控制等多種機器人平臺

統一動作空間

通過統一動作空間支持多種機器人控制方式

大規模預訓練

基於46個機器人數據集進行預訓練

模型能力

視覺-語言理解

機器人動作預測

多模態融合

擴散模型推理

使用案例

機器人控制

移動機械臂控制

根據語言指令和視覺輸入控制移動機械臂執行任務

可預測未來64個機器人動作

雙臂協調操作

控制雙臂機器人完成協調操作任務

🚀 RDT-170M

RDT-170M是一個擁有1.7億參數的模仿學習擴散變換器（在消融實驗中為 RDT(小) ）。它的隱藏層大小為 1024，深度為 14，均為RDT-1B的一半。給定語言指令和最多三個視角的RGB圖像，RDT可以預測接下來的64個機器人動作。RDT幾乎與所有現代移動操作機器人兼容，包括單臂到雙臂、關節到末端執行器、位置到速度，甚至輪式移動。

所有的代碼、預訓練模型權重和數據均遵循MIT許可證。

更多信息請參考我們的項目頁面和論文。

📚 詳細文檔

模型詳情

屬性	詳情
開發者	由來自清華大學TSAIL組的研究人員組成的RDT團隊
任務類型	視覺 - 語言 - 動作（語言、圖像 => 機器人動作）
模型類型	基於變換器的擴散策略
許可證	MIT
語言（NLP）	英文
多模態編碼器	視覺骨幹網絡：siglip-so400m-patch14-384；語言模型：t5-v1_1-xxl
預訓練數據集	46個數據集，包括RT - 1數據集、RH20T、DROID、BridgeData V2、RoboSet和Open X - Embodiment的一個子集。詳細列表見此鏈接
代碼倉庫	https://github.com/thu-ml/RoboticsDiffusionTransformer
論文	https://arxiv.org/pdf/2410.07864
項目頁面	https://rdt-robotics.github.io/rdt-robotics/

用途

RDT以語言指令、RGB圖像（最多三個視角）、控制頻率（如果有）和本體感覺作為輸入，預測接下來的64個機器人動作。

RDT藉助統一動作空間支持對幾乎所有機器人操作器的控制，該空間包含了機器人操作器的所有主要物理量（例如，末端執行器和關節、位置和速度，以及輪式移動）。要在你的機器人平臺上部署，你需要將原始動作向量的相關量填充到統一空間向量中。更多信息請參考我們的代碼倉庫。

⚠️ 重要提示

由於具身差距，RDT目前還不能泛化到新的機器人平臺（預訓練數據集中未出現過的）。在這種情況下，我們建議收集目標機器人的小數據集，然後用它對RDT進行微調。更多教程請參考我們的代碼倉庫。

💻 使用示例

基礎用法

# Please first clone the repository and install dependencies
# Then switch to the root directory of the repository by "cd RoboticsDiffusionTransformer"

# Import a create function from the code base
from scripts.agilex_model import create_model

# Names of cameras used for visual input
CAMERA_NAMES = ['cam_high', 'cam_right_wrist', 'cam_left_wrist']
config = {
    'episode_len': 1000,  # Max length of one episode
    'state_dim': 14,      # Dimension of the robot's state
    'chunk_size': 64,     # Number of actions to predict in one step
    'camera_names': CAMERA_NAMES,
}
pretrained_vision_encoder_name_or_path = "google/siglip-so400m-patch14-384" 
# Create the model with the specified configuration
model = create_model(
    args=config,
    dtype=torch.bfloat16, 
    pretrained_vision_encoder_name_or_path=pretrained_vision_encoder_name_or_path,
    pretrained='robotics-diffusion-transformer/rdt-1b',
    control_frequency=25,
)

# Start inference process
# Load the pre-computed language embeddings
# Refer to scripts/encode_lang.py for how to encode the language instruction
lang_embeddings_path = 'your/language/embedding/path'
text_embedding = torch.load(lang_embeddings_path)['embeddings']  
images: List(PIL.Image) = ... #  The images from last 2 frames
proprio = ... # The current robot state
# Perform inference to predict the next `chunk_size` actions
actions = policy.step(
    proprio=proprio,
    images=images,
    text_embeds=text_embedding 
)

📄 許可證

本項目遵循MIT許可證。

📖 引用

如果您覺得我們的工作有幫助，請引用我們的論文：

@article{liu2024rdt,
  title={RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation},
  author={Liu, Songming and Wu, Lingxuan and Li, Bangguo and Tan, Hengkai and Chen, Huayu and Wang, Zhengyi and Xu, Ke and Su, Hang and Zhu, Jun},
  journal={arXiv preprint arXiv:2410.07864},
  year={2024}
}

感謝您的支持！