OpenVLA-OFT Open-source Vision-Language-Action Model - Speed up fine-tuning and improve task success rate

Openvla 7b Oft Finetuned Libero Spatial

Developed by moojink

OpenVLA-OFT is an optimized vision-language-action model that significantly improves the running speed and task success rate of the basic OpenVLA model through fine-tuning technology.

Multimodal Fusion

Transformers

Open Source License:MIT #Vision-Language-Action Fine-tuning #Robot Control Optimization #Efficient Action Generation

Downloads 2,513

Release Time : 2/25/2025

Model Overview

This project focuses on the fine-tuning of vision-language-action models, aiming to optimize the model's running speed and improve the task success rate. It is suitable for LIBERO-Spatial tasks and uses optimized fine-tuning technology to enhance performance.

Model Features

Optimized Fine-tuning Technology

Adopt optimized fine-tuning technology to significantly improve the performance of the basic OpenVLA model

Efficient Action Generation

Capable of generating continuous action blocks, suitable for robot control tasks

Multimodal Input Processing

Support multimodal inputs of vision (images), language (task descriptions), and proprioceptive states

Model Capabilities

Vision-Language-Action Multimodal Processing

Robot Action Sequence Generation

Continuous Action Prediction

Task-oriented Control

Use Cases

Robot Control

LIBERO-Spatial Task Execution

Generate robot action sequences based on visual and language inputs

Improve task execution speed and success rate

🚀 Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

This project focuses on fine - tuning vision - language - action models to optimize speed and success, offering an improved OpenVLA - OFT checkpoint for LIBERO - Spatial.

This repository contains the OpenVLA - OFT checkpoint for LIBERO - Spatial, as described in Fine - Tuning Vision - Language - Action Models: Optimizing Speed and Success. OpenVLA - OFT significantly improves upon the base OpenVLA model by incorporating optimized fine - tuning techniques.

Project Page: https://openvla - oft.github.io/
Code: https://github.com/openvla - oft/openvla - oft
Other OpenVLA - OFT checkpoints: https://huggingface.co/moojink?search_models=oft

🚀 Quick Start

This example demonstrates generating an action chunk using a pretrained OpenVLA - OFT checkpoint. Ensure you have set up the conda environment as described in the GitHub README.

💻 Usage Examples

🔍 Basic Usage

import pickle
from experiments.robot.libero.run_libero_eval import GenerateConfig
from experiments.robot.openvla_utils import get_action_head, get_processor, get_proprio_projector, get_vla, get_vla_action
from prismatic.vla.constants import NUM_ACTIONS_CHUNK, PROPRIO_DIM

# Instantiate config (see class GenerateConfig in experiments/robot/libero/run_libero_eval.py for definitions)
cfg = GenerateConfig(
    pretrained_checkpoint = "moojink/openvla-7b-oft-finetuned-libero-spatial",
    use_l1_regression = True,
    use_diffusion = False,
    use_film = False,
    num_images_in_input = 2,
    use_proprio = True,
    load_in_8bit = False,
    load_in_4bit = False,
    center_crop = True,
    num_open_loop_steps = NUM_ACTIONS_CHUNK,
    unnorm_key = "libero_spatial_no_noops",
)

# Load OpenVLA-OFT policy and inputs processor
vla = get_vla(cfg)
processor = get_processor(cfg)

# Load MLP action head to generate continuous actions (via L1 regression)
action_head = get_action_head(cfg, llm_dim=vla.llm_dim)

# Load proprio projector to map proprio to language embedding space
proprio_projector = get_proprio_projector(cfg, llm_dim=vla.llm_dim, proprio_dim=PROPRIO_DIM)

# Load sample observation:
#   observation (dict): {
#     "full_image": primary third-person image,
#     "wrist_image": wrist-mounted camera image,
#     "state": robot proprioceptive state,
#     "task_description": task description,
#   }
with open("experiments/robot/libero/sample_libero_spatial_observation.pkl", "rb") as file:
    observation = pickle.load(file)

# Generate robot action chunk (sequence of future actions)
actions = get_vla_action(cfg, vla, processor, observation, observation["task_description"], action_head, proprio_projector)
print("Generated action chunk:")
for act in actions:
    print(act)

📄 License

This project is licensed under the MIT license.

📚 Citation

@article{kim2025fine,
  title={Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success},
  author={Kim, Moo Jin and Finn, Chelsea and Liang, Percy},
  journal={arXiv preprint arXiv:2502.19645},
  year={2025}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご