MiniVLA Open-Source Vision-Language Model - Free Support for Image-Text to Text Multimodal Tasks

Minivla Vq Libero90 Prismatic

Developed by Stanford-ILIAD

MiniVLA is a lightweight vision-language model compatible with the Prismatic VLMs training framework, supporting multimodal tasks from image-text to text.

Image-to-Text

Transformers

EnglishOpen Source License:MIT #Multimodal Pretraining #Vision-Language Understanding #Robotics

Downloads 31

Release Time : 12/11/2024

Model Overview

MiniVLA is a pretrained multimodal vision-language model focused on image-text to text tasks. The model is compatible with the Prismatic VLMs training framework and suitable for full fine-tuning.

Model Features

Compatible with Prismatic Training Framework

Can directly use the Prismatic VLMs project codebase for full fine-tuning

Lightweight Design

Smaller parameter scale compared to large vision-language models while maintaining excellent performance

Multimodal Capability

Capable of handling joint understanding and generation tasks involving both images and text

Model Capabilities

Image Understanding

Text Generation

Multimodal Reasoning

Visual Question Answering

Use Cases

Robotics

Visual Navigation Command Understanding

Assists robots in understanding visual scenes and generating corresponding action commands

Content Generation

Image Caption Generation

Generates natural language descriptions based on input images

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

Minivla Vq Libero90 Prismatic

Model Overview

Model Features

Model Capabilities

Use Cases

🚀 MiniVLA VQ 1B (Prismatic-Compatible Version)

🚀 Quick Start

📚 Documentation

📄 License

📚 Citation