Open-source OpenVLA-OFT Model - Finetune and Optimize Visual-Language-Action Abilities, Improve Performance and Speed

Openvla 7b Oft Finetuned Libero Goal

Developed by moojink

OpenVLA-OFT is an optimized vision-language-action model that significantly improves the performance and speed of the basic OpenVLA model through fine-tuning technology.

Multimodal Fusion

Transformers

Open Source License:MIT #Robot action generation #Vision-language-action model #Optimized fine-tuning technology

Downloads 579

Release Time : 2/25/2025

Model Overview

This model combines vision, language, and action generation capabilities, and is specifically optimized for robot tasks. It can generate continuous action sequences based on visual input and task descriptions.

Model Features

Optimized fine-tuning technology

Adopts OFT (Optimized Fine-Tuning) technology, with significant performance improvement compared to the basic model

Multimodal input processing

Can simultaneously process visual images, language descriptions, and proprioceptive state inputs

Continuous action generation

Generates continuous robot action sequences through an MLP action head

Model Capabilities

Vision-language understanding

Continuous action prediction

Robot task execution

Multimodal data fusion

Use Cases

Robot control

Spatial task execution

Completes spatial operation tasks based on visual input and task descriptions

Performs better than the basic model on the LIBERO-Goal task

🚀 Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

This project focuses on fine - tuning vision - language - action models to optimize speed and success, providing an improved OpenVLA - OFT checkpoint for LIBERO - Goal.

This repository contains the OpenVLA - OFT checkpoint for LIBERO - Goal, as described in Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success. OpenVLA - OFT significantly improves upon the base OpenVLA model by incorporating optimized fine - tuning techniques.

Project Page: https://openvla-oft.github.io/
Code: https://github.com/openvla-oft/openvla-oft
See here for other OpenVLA - OFT checkpoints: https://huggingface.co/moojink?search_models=oft

🚀 Quick Start

This example demonstrates generating an action chunk using a pretrained OpenVLA - OFT checkpoint. Ensure you have set up the conda environment as described in the GitHub README.

import pickle
from experiments.robot.libero.run_libero_eval import GenerateConfig
from experiments.robot.openvla_utils import get_action_head, get_processor, get_proprio_projector, get_vla, get_vla_action
from prismatic.vla.constants import NUM_ACTIONS_CHUNK, PROPRIO_DIM
# Instantiate config (see class GenerateConfig in experiments/robot/libero/run_libero_eval.py for definitions)
cfg = GenerateConfig(
    pretrained_checkpoint = "moojink/openvla-7b-oft-finetuned-libero-spatial",
    use_l1_regression = True,
    use_diffusion = False,
    use_film = False,
    num_images_in_input = 2,
    use_proprio = True,
    load_in_8bit = False,
    load_in_4bit = False,
    center_crop = True,
    num_open_loop_steps = NUM_ACTIONS_CHUNK,
    unnorm_key = "libero_spatial_no_noops",
)
# Load OpenVLA-OFT policy and inputs processor
vla = get_vla(cfg)
processor = get_processor(cfg)
# Load MLP action head to generate continuous actions (via L1 regression)
action_head = get_action_head(cfg, llm_dim=vla.llm_dim)
# Load proprio projector to map proprio to language embedding space
proprio_projector = get_proprio_projector(cfg, llm_dim=vla.llm_dim, proprio_dim=PROPRIO_DIM)

# Load sample observation:
#   observation (dict): {
#     "full_image": primary third-person image,
#     "wrist_image": wrist-mounted camera image,
#     "state": robot proprioceptive state,
#     "task_description": task description,
#   }
with open("experiments/robot/libero/sample_libero_spatial_observation.pkl", "rb") as file:
    observation = pickle.load(file)
# Generate robot action chunk (sequence of future actions)
actions = get_vla_action(cfg, vla, processor, observation, observation["task_description"], action_head, proprio_projector)
print("Generated action chunk:")
for act in actions:
    print(act)

📄 License

The project is under the MIT license.

📚 Citation

@article{kim2025fine,
  title={Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success},
  author={Kim, Moo Jin and Finn, Chelsea and Liang, Percy},
  journal={arXiv preprint arXiv:2502.19645},
  year={2025}
}

Property	Details
Pipeline Tag	robotics
Library Name	transformers
License	mit

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご