T

Tinyllava Phi 2 SigLIP 3.1B

Developed by tinyllava
TinyLLaVA-Phi-2-SigLIP-3.1B is a small-scale large multimodal model with 3.1B parameters, combining the Phi-2 language model and SigLIP vision model, outperforming some 7B models.
Downloads 4,295
Release Time : 5/15/2024

Model Overview

This model is an image-text-to-text multimodal model capable of processing joint image and text inputs to generate corresponding text outputs.

Model Features

Efficient Performance
The 3.1B parameter model outperforms some 7B models like LLaVA-1.5 and Qwen-VL.
Multimodal Capability
Capable of processing both image and text inputs to generate coherent text outputs.
Modular Design
Based on the TinyLLaVA Factory codebase, supporting flexible model component replacement and expansion.

Model Capabilities

Image Understanding
Text Generation
Multimodal Reasoning
Visual Question Answering

Use Cases

Visual Question Answering
Image Content Q&A
Answer questions based on input images
Achieves 80.1 accuracy on the VQAv2 dataset
Multimodal Dialogue
Image-guided Dialogue
Conduct natural language dialogues based on image content
Scores 37.5 in MM-VET evaluation
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
Š 2025AIbase