The open-source model llava-phi-3-mini-hf - Free to achieve the practical function of image-to-text conversion

Home

Llava Phi 3 Mini Hf

Developed by xtuner

LLaVA model fine-tuned based on Phi-3-mini-4k-instruct and CLIP-ViT-Large-patch14-336, supporting image-to-text tasks

Image-to-Text

Transformers

#Multimodal Dialogue #Image Understanding #Small Parameter Efficiency

Downloads 2,322

Release Time : 4/25/2024

Model Overview

LLaVA-Phi-3-mini is a vision-language model capable of understanding image content and generating relevant text descriptions, suitable for multimodal interaction scenarios.

Model Features

Efficient Fine-tuning

Efficient fine-tuning using XTuner tools, combining the strengths of Phi-3-mini and CLIP-ViT

Multimodal Capability

Capable of processing both visual and linguistic information to achieve image-to-text conversion

High Performance

Excellent performance in multiple benchmarks such as MMBench, MMMU, etc.

Model Capabilities

Image Understanding

Text Generation

Multimodal Interaction

Visual Question Answering

Use Cases

Education

Scientific Diagram Analysis

Analyze scientific diagrams and explain their content

For example, accurately identifying the lava section in a volcano structure diagram

Content Understanding

Image Caption Generation

Generate detailed text descriptions for images

For example, accurately describing a scene where two cats are sleeping on a sofa

🚀 llava-phi-3-mini

llava-phi-3-mini is a fine - tuned LLaVA model that combines powerful language and visual understanding capabilities, offering high - quality image - to - text conversion.

🚀 Quick Start

Chat by `pipeline`

from transformers import pipeline
from PIL import Image    
import requests

model_id = "xtuner/llava-phi-3-mini-hf"
pipe = pipeline("image-to-text", model=model_id, device=0)
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/ai2d-demo.jpg"

image = Image.open(requests.get(url, stream=True).raw)
prompt = "<|user|>\n<image>\nWhat does the label 15 represent? (1) lava (2) core (3) tunnel (4) ash cloud<|end|>\n<|assistant|>\n"

outputs = pipe(image, prompt=prompt, generate_kwargs={"max_new_tokens": 200})
print(outputs)
>>> [{'generated_text': '\nWhat does the label 15 represent? (1) lava (2) core (3) tunnel (4) ash cloud (1) lava'}]

Chat by pure `transformers`

import requests
from PIL import Image

import torch
from transformers import AutoProcessor, LlavaForConditionalGeneration

model_id = "xtuner/llava-phi-3-mini-hf"

prompt = "<|user|>\n<image>\nWhat are these?<|end|>\n<|assistant|>\n"
image_file = "http://images.cocodataset.org/val2017/000000039769.jpg"

model = LlavaForConditionalGeneration.from_pretrained(
    model_id, 
    torch_dtype=torch.float16, 
    low_cpu_mem_usage=True, 
).to(0)

processor = AutoProcessor.from_pretrained(model_id)

raw_image = Image.open(requests.get(image_file, stream=True).raw)
inputs = processor(prompt, raw_image, return_tensors='pt').to(0, torch.float16)

output = model.generate(**inputs, max_new_tokens=200, do_sample=False)
print(processor.decode(output[0][2:], skip_special_tokens=True))
>>> What are these? These are two cats sleeping on a pink couch.

Reproduce

Please refer to docs.

✨ Features

Fine - tuned Model: llava-phi-3-mini is fine - tuned from microsoft/Phi-3-mini-4k-instruct and CLIP-ViT-Large-patch14-336 with ShareGPT4V-PT and InternVL-SFT by XTuner.
Multiple Formats: Available in HuggingFace LLaVA format, and also has GGUF and XTuner LLaVA format models.

📚 Documentation

Model

llava-phi-3-mini is a LLaVA model fine - tuned from microsoft/Phi-3-mini-4k-instruct and CLIP-ViT-Large-patch14-336 with ShareGPT4V-PT and InternVL-SFT by XTuner.

Note: This model is in HuggingFace LLaVA format.

Resources:

GitHub: xtuner
Official LLaVA format model: xtuner/llava-phi-3-mini
GGUF LLaVA model: xtuner/llava-phi-3-mini-gguf
XTuner LLaVA format model: xtuner/llava-phi-3-mini-xtuner

Details

Property	Details
Model Type	LLaVA model fine - tuned from microsoft/Phi-3-mini-4k-instruct and CLIP-ViT-Large-patch14-336
Training Data	ShareGPT4V-PT and InternVL-SFT

Model	Visual Encoder	Projector	Resolution	Pretraining Strategy	Fine - tuning Strategy	Pretrain Dataset	Fine - tune Dataset	Pretrain Epoch	Fine - tune Epoch
LLaVA-v1.5-7B	CLIP - L	MLP	336	Frozen LLM, Frozen ViT	Full LLM, Frozen ViT	LLaVA - PT (558K)	LLaVA - Mix (665K)	1	1
LLaVA-Llama-3-8B	CLIP - L	MLP	336	Frozen LLM, Frozen ViT	Full LLM, LoRA ViT	LLaVA - PT (558K)	LLaVA - Mix (665K)	1	1
LLaVA-Llama-3-8B-v1.1	CLIP - L	MLP	336	Frozen LLM, Frozen ViT	Full LLM, LoRA ViT	ShareGPT4V - PT (1246K)	InternVL - SFT (1268K)	1	1
LLaVA-Phi-3-mini	CLIP - L	MLP	336	Frozen LLM, Frozen ViT	Full LLM, Full ViT	ShareGPT4V - PT (1246K)	InternVL - SFT (1268K)	1	2

Results

Model	MMBench Test (EN)	MMMU Val	SEED - IMG	AI2D Test	ScienceQA Test	HallusionBench aAcc	POPE	GQA	TextVQA	MME	MMStar
LLaVA-v1.5-7B	66.5	35.3	60.5	54.8	70.4	44.9	85.9	62.0	58.2	1511/348	30.3
LLaVA-Llama-3-8B	68.9	36.8	69.8	60.9	73.3	47.3	87.2	63.5	58.0	1506/295	38.2
LLaVA-Llama-3-8B-v1.1	72.3	37.1	70.1	70.0	72.9	47.7	86.4	62.6	59.0	1469/349	45.1
LLaVA-Phi-3-mini	69.2	41.4	70.0	69.3	73.7	49.8	87.3	61.5	57.8	1477/313	43.7

📄 License

Citation

@misc{2023xtuner,
    title={XTuner: A Toolkit for Efficiently Fine-tuning LLM},
    author={XTuner Contributors},
    howpublished = {\url{https://github.com/InternLM/xtuner}},
    year={2023}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご