L

Llava Gemma 7b

Developed by Intel
LLaVA-Gemma-7b is a large multimodal model trained based on the LLaVA-v1.5 framework, using google/gemma-7b-it as the language backbone combined with a CLIP visual encoder, suitable for multimodal understanding and generation tasks.
Downloads 161
Release Time : 3/26/2024

Model Overview

This model is a large multimodal model (LMM) capable of processing image and text inputs to generate text outputs, suitable for multimodal chatbots and multimodal benchmark evaluations.

Model Features

Multimodal Understanding
Capable of processing both image and text inputs to understand the relationship between them
Efficient Training
Requires only 4 hours of training on 8 Intel Gaudi 2 AI accelerators
Compact Model
Based on the 7B-parameter Gemma model, reducing computational resource requirements while maintaining performance

Model Capabilities

Image Understanding
Text Generation
Multimodal Dialogue
Visual Question Answering

Use Cases

Multimodal Chatbot
Image Caption Generation
Generate descriptive text based on input images
Achieved 68.7 accuracy on the VQAv2 benchmark
Multimodal Dialogue
Engage in natural dialogue combining images and text
Scored 18.2 on the MM-Vet benchmark
Academic Research
Multimodal Model Research
Used to explore the trade-off between computational efficiency and multimodal understanding in small-scale models
Provides two variants, Gemma-2B and Gemma-7B, for comparative analysis
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase