開源OpenVLA-OFT模型 - 微調優化視覺語言動作能力，提升性能與速度

首頁

Openvla 7b Oft Finetuned Libero Goal

由moojink開發

OpenVLA-OFT是一個經過優化的視覺-語言-動作模型，通過微調技術顯著提升了基礎OpenVLA模型的性能和速度。

多模態融合

Transformers

開源協議:MIT #機器人動作生成 #視覺語言動作模型 #優化微調技術

下載量 579

發布時間 : 2/25/2025

模型概述

該模型結合視覺、語言和動作生成能力，專門針對機器人任務進行優化，能夠根據視覺輸入和任務描述生成連續的動作序列。

模型特點

優化的微調技術

採用OFT(Optimized Fine-Tuning)技術，相比基礎模型有顯著性能提升

多模態輸入處理

能夠同時處理視覺圖像、語言描述和本體感受狀態輸入

連續動作生成

通過MLP動作頭生成連續的機器人動作序列

模型能力

視覺-語言理解

連續動作預測

機器人任務執行

多模態數據融合

使用案例

機器人控制

空間任務執行

根據視覺輸入和任務描述完成空間操作任務

在LIBERO-Goal任務上表現優於基礎模型

🚀 微調視覺 - 語言 - 動作模型：優化速度與成功率

本倉庫包含適用於LIBERO - Goal的OpenVLA - OFT檢查點，相關內容詳見論文微調視覺 - 語言 - 動作模型：優化速度與成功率。OpenVLA - OFT通過採用優化的微調技術，相較於基礎的OpenVLA模型有顯著提升。

項目頁面：https://openvla - oft.github.io/

代碼倉庫：https://github.com/openvla - oft/openvla - oft

其他OpenVLA - OFT檢查點請見：https://huggingface.co/moojink?search_models=oft

🚀 快速開始

此示例展示瞭如何使用預訓練的OpenVLA - OFT檢查點生成動作塊。請確保你已按照GitHub README中的說明設置好conda環境。

基礎用法

import pickle
from experiments.robot.libero.run_libero_eval import GenerateConfig
from experiments.robot.openvla_utils import get_action_head, get_processor, get_proprio_projector, get_vla, get_vla_action
from prismatic.vla.constants import NUM_ACTIONS_CHUNK, PROPRIO_DIM
# Instantiate config (see class GenerateConfig in experiments/robot/libero/run_libero_eval.py for definitions)
cfg = GenerateConfig(
    pretrained_checkpoint = "moojink/openvla-7b-oft-finetuned-libero-spatial",
    use_l1_regression = True,
    use_diffusion = False,
    use_film = False,
    num_images_in_input = 2,
    use_proprio = True,
    load_in_8bit = False,
    load_in_4bit = False,
    center_crop = True,
    num_open_loop_steps = NUM_ACTIONS_CHUNK,
    unnorm_key = "libero_spatial_no_noops",
)
# Load OpenVLA-OFT policy and inputs processor
vla = get_vla(cfg)
processor = get_processor(cfg)
# Load MLP action head to generate continuous actions (via L1 regression)
action_head = get_action_head(cfg, llm_dim=vla.llm_dim)
# Load proprio projector to map proprio to language embedding space
proprio_projector = get_proprio_projector(cfg, llm_dim=vla.llm_dim, proprio_dim=PROPRIO_DIM)

# Load sample observation:
#   observation (dict): {
#     "full_image": primary third-person image,
#     "wrist_image": wrist-mounted camera image,
#     "state": robot proprioceptive state,
#     "task_description": task description,
#   }
with open("experiments/robot/libero/sample_libero_spatial_observation.pkl", "rb") as file:
    observation = pickle.load(file)
# Generate robot action chunk (sequence of future actions)
actions = get_vla_action(cfg, vla, processor, observation, observation["task_description"], action_head, proprio_projector)
print("Generated action chunk:")
for act in actions:
    print(act)

📄 許可證

本項目採用MIT許可證。

📚 引用

@article{kim2025fine,
  title={Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success},
  author={Kim, Moo Jin and Finn, Chelsea and Liang, Percy},
  journal={arXiv preprint arXiv:2502.19645},
  year={2025}
}