Open-source LLaVA-Phi-3-mini-gguf Model - Effortlessly Convert Images to Text for Free!

Llava Phi 3 Mini Gguf

Developed by xtuner

LLaVA-Phi-3-mini is a fine-tuned LLaVA model based on Phi-3-mini-4k-instruct and CLIP-ViT-Large-patch14-336, specializing in image-to-text tasks.

Image-to-Text #Multimodal Dialogue #Image-to-Text #Efficient Fine-tuning

Downloads 1,676

Release Time : 4/25/2024

Model Overview

This model combines the language capabilities of Phi-3-mini-4k-instruct with the visual encoding power of CLIP-ViT-Large-patch14-336 for image understanding and text generation tasks.

Model Features

Efficient Fine-tuning

Utilizes the XTuner toolkit for efficient fine-tuning, combining the strengths of Phi-3-mini and CLIP-ViT.

Multimodal Capability

Capable of processing both image and text inputs to generate relevant textual descriptions.

High Performance

Demonstrates excellent performance across multiple benchmarks such as MMBench, MMMU, and SEED-IMG.

Model Capabilities

Image Understanding

Text Generation

Multimodal Reasoning

Use Cases

Image Captioning

Automatic Image Annotation

Generates detailed textual descriptions for images, suitable for content management and retrieval.

Achieved 70.0 accuracy on the SEED-IMG test.

Visual Question Answering

Image Content Q&A

Answers complex questions about image content.

Achieved 69.2 accuracy on the MMBench test.

datasets:

Lin-Chen/ShareGPT4V pipeline_tag: image-to-text

Model

llava-phi-3-mini is a LLaVA model fine-tuned from microsoft/Phi-3-mini-4k-instruct and CLIP-ViT-Large-patch14-336 with ShareGPT4V-PT and InternVL-SFT by XTuner.

Note: This model is in GGUF format.

Resources:

GitHub: xtuner
Official LLaVA format model: xtuner/llava-phi-3-mini
HuggingFace LLaVA format model: xtuner/llava-phi-3-mini-hf
XTuner LLaVA format model: xtuner/llava-phi-3-mini-xtuner

Details

Model	Visual Encoder	Projector	Resolution	Pretraining Strategy	Fine-tuning Strategy	Pretrain Dataset	Fine-tune Dataset	Pretrain Epoch	Fine-tune Epoch
LLaVA-v1.5-7B	CLIP-L	MLP	336	Frozen LLM, Frozen ViT	Full LLM, Frozen ViT	LLaVA-PT (558K)	LLaVA-Mix (665K)	1	1
LLaVA-Llama-3-8B	CLIP-L	MLP	336	Frozen LLM, Frozen ViT	Full LLM, LoRA ViT	LLaVA-PT (558K)	LLaVA-Mix (665K)	1	1
LLaVA-Llama-3-8B-v1.1	CLIP-L	MLP	336	Frozen LLM, Frozen ViT	Full LLM, LoRA ViT	ShareGPT4V-PT (1246K)	InternVL-SFT (1268K)	1	1
LLaVA-Phi-3-mini	CLIP-L	MLP	336	Frozen LLM, Frozen ViT	Full LLM, Full ViT	ShareGPT4V-PT (1246K)	InternVL-SFT (1268K)	1	2

Results

Model	MMBench Test (EN)	MMMU Val	SEED-IMG	AI2D Test	ScienceQA Test	HallusionBench aAcc	POPE	GQA	TextVQA	MME	MMStar
LLaVA-v1.5-7B	66.5	35.3	60.5	54.8	70.4	44.9	85.9	62.0	58.2	1511/348	30.3
LLaVA-Llama-3-8B	68.9	36.8	69.8	60.9	73.3	47.3	87.2	63.5	58.0	1506/295	38.2
LLaVA-Llama-3-8B-v1.1	72.3	37.1	70.1	70.0	72.9	47.7	86.4	62.6	59.0	1469/349	45.1
LLaVA-Phi-3-mini	69.2	41.4	70.0	69.3	73.7	49.8	87.3	61.5	57.8	1477/313	43.7

Quickstart

Download models

# mmproj
wget https://huggingface.co/xtuner/llava-phi-3-mini-gguf/resolve/main/llava-phi-3-mini-mmproj-f16.gguf

# fp16 llm
wget https://huggingface.co/xtuner/llava-phi-3-mini-gguf/resolve/main/llava-phi-3-mini-f16.gguf

# int4 llm
wget https://huggingface.co/xtuner/llava-phi-3-mini-gguf/resolve/main/llava-phi-3-mini-int4.gguf

# (optional) ollama fp16 modelfile
wget https://huggingface.co/xtuner/llava-phi-3-mini-gguf/resolve/main/OLLAMA_MODELFILE_F16

# (optional) ollama int4 modelfile
wget https://huggingface.co/xtuner/llava-phi-3-mini-gguf/resolve/main/OLLAMA_MODELFILE_INT4

Chat by `ollama`

Note: llava-phi-3-mini uses the Phi-3-instruct chat template.

# fp16
ollama create llava-phi3-f16 -f ./OLLAMA_MODELFILE_F16
ollama run llava-phi3-f16 "xx.png Describe this image"

# int4
ollama create llava-phi3-int4 -f ./OLLAMA_MODELFILE_INT4
ollama run llava-phi3-int4 "xx.png Describe this image"

Chat by `./llava-cli`

Build llama.cpp (docs) .
Build ./llava-cli (docs).

Note: llava-phi-3-mini uses the Phi-3-instruct chat template.

# fp16
./llava-cli -m ./llava-phi-3-mini-f16.gguf --mmproj ./llava-phi-3-mini-mmproj-f16.gguf --image YOUR_IMAGE.jpg -c 4096

# int4
./llava-cli -m ./llava-phi-3-mini-int4.gguf --mmproj ./llava-phi-3-mini-mmproj-f16.gguf --image YOUR_IMAGE.jpg -c 4096

Reproduce

Please refer to docs.

Citation

@misc{2023xtuner,
    title={XTuner: A Toolkit for Efficiently Fine-tuning LLM},
    author={XTuner Contributors},
    howpublished = {\url{https://github.com/InternLM/xtuner}},
    year={2023}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご