Llava-Phi2 Open-Source Multimodal Model - Supports Image-Text to Text Tasks, Excellent for Vision-Language Processing

Llava Phi2

Developed by RaviNaik

Llava-Phi2 is a multimodal implementation based on Phi2, combining vision and language processing capabilities, suitable for image-text-to-text tasks.

Image-to-Text

Transformers

EnglishOpen Source License:MIT #Multimodal QA #Lightweight LLM #Image-Text Understanding

Downloads 153

Release Time : 1/24/2024

Model Overview

This model integrates the Phi2 language model and CLIP vision module, capable of handling joint tasks involving images and text, such as visual question answering and image caption generation.

Model Features

Multimodal Capability

Combines vision and language processing to understand and generate text related to images.

Efficient Small Model

Based on Phi2, it has a smaller parameter size but remains highly efficient, making it suitable for resource-limited environments.

Pre-training and Fine-tuning Integration

Utilizes large-scale pre-training datasets and fine-tuning datasets to enhance model performance.

Model Capabilities

Visual Question Answering

Image Caption Generation

Multimodal Reasoning

Use Cases

Visual Question Answering

Image Content QA

Answer natural language questions about image content.

Can accurately answer questions about objects, scenes, and actions in images.

Image Caption Generation

Automatic Image Annotation

Generate natural language descriptions for images.

Produces fluent and accurate image descriptions.

🚀 Model Card for Model ID

This is a multimodal implementation of the Phi2 model, inspired by LlaVA-Phi. It combines the power of the Phi2 language model with visual capabilities, enabling it to handle image - text tasks effectively.

✨ Features

Multimodal implementation based on the Phi2 model.
Utilizes a specific vision tower and well - curated datasets for pre - training and fine - tuning.
Offers a practical way to start using the model with provided code examples.

📦 Installation

Prerequisites

Clone this repository and navigate to the llava - phi folder:

git clone https://github.com/zhuyiche/llava-phi.git
cd llava-phi

Install the necessary packages:

conda create -n llava_phi python=3.10 -y
conda activate llava_phi
pip install --upgrade pip  # enable PEP 660 support
pip install -e .

💻 Usage Examples

Basic Usage

Run the model with the following command:

python llava_phi/eval/run_llava_phi.py --model-path="RaviNaik/Llava-Phi2" \
    --image-file="https://huggingface.co/RaviNaik/Llava-Phi2/resolve/main/people.jpg?download=true" \
    --query="How many people are there in the image?"

📚 Documentation

Model Details

Property	Details
LLM Backbone	Phi2
Vision Tower	[clip - vit - large - patch14 - 336](https://huggingface.co/openai/clip - vit - large - patch14 - 336)
Pretraining Dataset	[LAION - CC - SBU dataset with BLIP captions(200k samples)](https://huggingface.co/datasets/liuhaotian/LLaVA - Pretrain)
Finetuning Dataset	[Instruct 150k dataset based on COCO](https://huggingface.co/datasets/liuhaotian/LLaVA - Instruct - 150K)
Finetuned Model	[RaviNaik/Llava - Phi2](https://huggingface.co/RaviNaik/Llava - Phi2)

Model Sources

Original Repository: [Llava - Phi](https://github.com/zhuyiche/llava - phi)
Paper [optional]: LLaVA - Phi: Efficient Multi - Modal Assistant with Small Language Model
Demo [optional]: [Demo Link](https://huggingface.co/spaces/RaviNaik/MultiModal - Phi2)

Acknowledgement

This implementation is based on the wonderful work done by:

[LlaVA - Phi](https://github.com/zhuyiche/llava - phi)
[Llava](https://github.com/haotian - liu/LLaVA)
[Phi2](https://huggingface.co/microsoft/phi - 2)

📄 License

This project is licensed under the MIT license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご