L

Llava Phi 3 Mini Hf

Developed by xtuner
LLaVA model fine-tuned based on Phi-3-mini-4k-instruct and CLIP-ViT-Large-patch14-336, supporting image-to-text tasks
Downloads 2,322
Release Time : 4/25/2024

Model Overview

LLaVA-Phi-3-mini is a vision-language model capable of understanding image content and generating relevant text descriptions, suitable for multimodal interaction scenarios.

Model Features

Efficient Fine-tuning
Efficient fine-tuning using XTuner tools, combining the strengths of Phi-3-mini and CLIP-ViT
Multimodal Capability
Capable of processing both visual and linguistic information to achieve image-to-text conversion
High Performance
Excellent performance in multiple benchmarks such as MMBench, MMMU, etc.

Model Capabilities

Image Understanding
Text Generation
Multimodal Interaction
Visual Question Answering

Use Cases

Education
Scientific Diagram Analysis
Analyze scientific diagrams and explain their content
For example, accurately identifying the lava section in a volcano structure diagram
Content Understanding
Image Caption Generation
Generate detailed text descriptions for images
For example, accurately describing a scene where two cats are sleeping on a sofa
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
Š 2025AIbase