Open-Source Multimodal Chatbot LLaVA - Free Deployment, Support for Image & Text Multimodal Interaction

Llava Llama 2 7b Chat Lightning Lora Preview

Developed by liuhaotian

LLaVA is an open-source multimodal chatbot, fine-tuned based on LLaMA/Vicuna and trained with GPT-generated multimodal instruction-following data.

Text-to-Image

Transformers

#Multimodal Instruction Following #Visual Reasoning QA #GPT-4 Collaborative Evaluation

Downloads 251

Release Time : 7/19/2023

Model Overview

LLaVA is a multimodal model combining vision and language understanding, primarily used for research on large multimodal models and chatbot development.

Model Features

Multimodal Capability

Processes both image and text inputs for cross-modal understanding

Instruction Following

Capable of understanding and executing complex multimodal instructions

Open-source Model

Fully open-source, available for research and commercial use

Model Capabilities

Image caption generation

Visual Question Answering

Multimodal dialogue

Complex reasoning

Detailed description

Use Cases

Research

Multimodal Model Research

Used to study the performance and capability boundaries of vision-language models

Achieved state-of-the-art performance on the ScienceQA dataset

Application Development

Intelligent Chatbot

Develop dialogue systems capable of understanding image content

🚀 LLaVA Model Card

LLaVA is an open - source chatbot that offers valuable insights and capabilities for research in large multimodal models and chatbots. It is trained on specific datasets and provides a new perspective in the field of AI.

🚀 Quick Start

No quick start steps are provided in the original document.

✨ Features

LLaVA is an auto - regressive language model based on the transformer architecture.
It is trained by fine - tuning LLaMA/Vicuna on GPT - generated multimodal instruction - following data.

📦 Installation

No installation steps are provided in the original document.

💻 Usage Examples

No usage examples are provided in the original document.

📚 Documentation

Model details

Property	Details
Model Type	LLaVA is an open - source chatbot trained by fine - tuning LLaMA/Vicuna on GPT - generated multimodal instruction - following data. It is an auto - regressive language model, based on the transformer architecture.
Model Date	LLaVA - LLaMA - 2 - 7B - Chat - LoRA - Preview was trained in July 2023.
Paper or resources for more information	https://llava-vl.github.io/

Intended use

Property	Details
Primary intended uses	The primary use of LLaVA is research on large multimodal models and chatbots.
Primary intended users	The primary intended users of the model are researchers and hobbyists in computer vision, natural language processing, machine learning, and artificial intelligence.

Training dataset

558K filtered image - text pairs from LAION/CC/SBU, captioned by BLIP.
80K GPT - generated multimodal instruction - following data.

Evaluation dataset

A preliminary evaluation of the model quality is conducted by creating a set of 90 visual reasoning questions from 30 unique images randomly sampled from COCO val 2014 and each is associated with three types of questions: conversational, detailed description, and complex reasoning. We utilize GPT - 4 to judge the model outputs. We also evaluate our model on the ScienceQA dataset. Our synergy with GPT - 4 sets a new state - of - the - art on the dataset. See https://llava-vl.github.io/ for more details.

🔧 Technical Details

No technical details are provided in the original document.

📄 License

Where to send questions or comments about the model: [https://github.com/haotian - liu/LLaVA/issues](https://github.com/haotian - liu/LLaVA/issues)

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご