MiniVLA Open-Source Vision-Language-Action Model - Small in Size, Excellent in Performance, for Robots and Multimodal Tasks

Minivla History2 Vq Libero90 Prismatic

Developed by Stanford-ILIAD

MiniVLA is a compact yet high-performance vision-language-action model, compatible with Prismatic VLMs training scripts, suitable for robotics and multimodal tasks.

Image-to-Text

Transformers

EnglishOpen Source License:MIT #Multimodal Pretraining #Vision-Language-Action #Robot Control

Downloads 22

Release Time : 12/11/2024

Model Overview

MiniVLA is a vision-language-action model that supports image-text-to-text conversion with multimodal processing capabilities. The model is compatible with the Prismatic VLMs project codebase and suitable for full fine-tuning or parameter-efficient fine-tuning via LoRA.

Model Features

Compatible with Prismatic Training Scripts

Supports native PyTorch FSDP full fine-tuning and is compatible with the Prismatic VLMs project codebase.

Parameter-Efficient Fine-Tuning

Supports parameter-efficient fine-tuning via LoRA, ideal for limited computational resources.

Multimodal Processing

Capable of processing joint image and text inputs for vision-language-action modeling.

Model Capabilities

Image-text conversion

Multimodal processing

Vision-language-action modeling

Use Cases

Robotics

Vision-Language-Action Control

Control robots to perform specific actions through image and text inputs.

Multimodal Interaction

Image Caption Generation

Generate corresponding text descriptions based on input images.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

Minivla History2 Vq Libero90 Prismatic

Model Overview

Model Features

Model Capabilities

Use Cases

🚀 MiniVLA Image History (T=2) VQ 1B (Prismatic-Compatible Version)

🚀 Quick Start

📚 Documentation

📄 License

📚 Citation