Open-source MiniVLA Vision-Language-Action Model - For Multimodal Image-to-Text and Text-to-Text Tasks in Robotics

Home

Minivla Wrist Vq Libero90 Prismatic

Developed by Stanford-ILIAD

MiniVLA is a vision-language-action model focused on robotics, supporting multimodal tasks from image-text to text.

Image-to-Text

Transformers

EnglishOpen Source License:MIT #Multimodal Pretraining #Robot Vision-Language #Prismatic-Compatible

Downloads 18

Release Time : 12/12/2024

Model Overview

MiniVLA is a 1-billion-parameter vision-language-action model designed for robotics, capable of processing image and text inputs to generate text outputs. The model is compatible with Prismatic VLMs training scripts and suitable for full fine-tuning.

Model Features

Prismatic Training Script Compatibility

Adopts a format compatible with the Prismatic VLMs project codebase, facilitating full fine-tuning using native PyTorch FSDP.

Multimodal Processing Capability

Capable of processing both image and text inputs to generate text outputs.

Robotics Optimization

Designed and optimized specifically for robotics applications.

Model Capabilities

Image Understanding

Text Generation

Multimodal Processing

Robot Control

Use Cases

Robotics

Vision-Language Navigation

Robot navigation based on visual and language instructions

Multimodal Interaction

Robots understanding visual and language inputs and responding accordingly

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

Minivla Wrist Vq Libero90 Prismatic

Model Overview

Model Features

Model Capabilities

Use Cases

🚀 MiniVLA 1B Wrist VQ (Prismatic-Compatible Version)

🚀 Quick Start

📚 Documentation

📄 License

📖 Citation