L

Llava Gemma 2b

Developed by Intel
LLaVA-Gemma-2b is a large multimodal model trained based on the LLaVA-v1.5 framework, utilizing the 2-billion-parameter Gemma-2b-it as the language backbone combined with a CLIP visual encoder.
Downloads 1,503
Release Time : 3/14/2024

Model Overview

This model is fine-tuned for multimodal benchmark evaluations and can serve as a multimodal chatbot, supporting interactions with both images and text.

Model Features

Compact and Efficient
Utilizes the 2-billion-parameter Gemma-2b-it as the language backbone, reducing computational resource requirements while maintaining performance.
Multimodal Understanding
Combines with a CLIP visual encoder to process both image and text inputs, enabling cross-modal understanding.
Fast Training
Training can be completed in just 4 hours on 8 Intel Gaudi 2 AI accelerators.

Model Capabilities

Image caption generation
Visual Question Answering
Multimodal dialogue
Text summarization

Use Cases

Multimodal Chatbot
Image Content Q&A
Users upload images and ask related questions, and the model generates accurate descriptions and answers.
Achieved an accuracy of 70.7 on the VQAv2 benchmark
Academic Research
Multimodal Model Research
Provides researchers with a compact model research platform to explore the balance between computational efficiency and multimodal understanding.
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase