Gemma 3 4B Open Source Model - Optimized by OpenVINO, Supports Text and Vision-Text Inference

Gemma 3 4b It Int8 Asym Ov

Developed by Echo9Zulu

Gemma 3 4B parameter model optimized with OpenVINO, supporting text-to-text and visual-text inference

Image-to-Text Open Source License:Apache-2.0 #Multimodal Text Generation #Intel Hardware Optimization #Low-Latency Inference

Downloads 152

Release Time : 4/12/2025

Model Overview

This model is an OpenVINO-optimized version of Google's Gemma 3 4B parameter model, converted to INT8 format via Optimum-Intel, supporting multimodal inference tasks from image-text to text.

Model Features

OpenVINO Optimization

Optimized with Intel OpenVINO toolkit to enhance inference performance on Intel hardware

Multimodal Support

Supports simultaneous processing of image and text inputs for visual-text inference

INT8 Quantization

Utilizes asymmetric INT8 quantization to reduce model size while maintaining accuracy

Low-Latency Optimization

Specially optimized for first-token latency, suitable for real-time applications

Model Capabilities

Text Generation

Image Captioning

Multimodal Inference

Dialogue Systems

Use Cases

Content Generation

Image Captioning

Generate detailed descriptions based on input images

Can produce accurate text descriptions reflecting image content

Intelligent Assistants

Visual Question Answering

Answer natural language questions about image content

Can understand image content and provide relevant answers

🚀 Gemma 3 for OpenArc

My Project OpenArc, an inference engine for OpenVINO, now supports Gemma 3 model and provides inference services over OpenAI-compatible endpoints for both text-to-text and text-with-vision tasks.

🚀 Quick Start

Model Compatibility

My Project OpenArc, an inference engine for OpenVINO, now supports the Gemma 3 model. It offers inference services over OpenAI-compatible endpoints for text-to-text and text-with-vision tasks. The release is scheduled for today or tomorrow.

Community

We have a growing Discord community of users interested in using Intel for AI/ML.

📦 Installation

Convert to OpenVINO IR Format

This model was converted to the OpenVINO IR format using the following Optimum-CLI command:

optimum-cli export openvino -m ""input-model"" --task image-text-to-text --weight-format int8 ""converted-model""

Find documentation on the Optimum-CLI export process here
Use my HF space Echo9Zulu/Optimum-CLI-Tool_tool to build commands and execute locally

Install Dependencies

To run the test code, you need to:

Install device specific drivers
Build Optimum-Intel for OpenVINO from source
Find your spiciest images to get that AGI refusal smell

pip install optimum[openvino]+https://github.com/huggingface/optimum-intel

💻 Usage Examples

Basic Usage

import time
from PIL import Image
from transformers import AutoProcessor
from optimum.intel.openvino import OVModelForVisualCausalLM


model_id = "Echo9Zulu/gemma-3-4b-it-int8_asym-ov" # Can be an HF id or a path

ov_config = {"PERFORMANCE_HINT": "LATENCY"} # Optimizes for first token latency and locks to single CPU socket

print("Loading model... this should get faster after the first generation due to caching behavior.")
print("")
start_load_time = time.time()
model = OVModelForVisualCausalLM.from_pretrained(model_id, export=False, device="CPU", ov_config=ov_config) # For GPU use "GPU.0"
processor = AutoProcessor.from_pretrained(model_id) # Instead of using AutoTokenizers we use AutoProcessor which routes to the appropriate input processor i.e, how does a model expect image tokens.
                                                    # Under the hood this takes care of model specific preprocessing and has functionality overlap with AutoTokenizers.
end_load_time = time.time()

image_path = r"" # This script expects .png
image = Image.open(image_path)
image = image.convert("RGB") # Required by gemma3. In practice this would need to be handled at the engine level OR in model-specifc pre-processing.

conversation = [
    {
        "role": "user",
        "content": [
            {
                "type": "image"
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

text_prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)

inputs = processor(text=[text_prompt], images=[image], padding=True, return_tensors="pt")

input_token_count = len(inputs.input_ids[0]) 
print(f"Sum of image and text tokens: {len(inputs.input_ids[0])}")

start_time = time.time()
output_ids = model.generate(**inputs, max_new_tokens=1024)

generated_ids = [output_ids[len(input_ids) :] for input_ids, output_ids in zip(inputs.input_ids, output_ids)]
output_text = processor.batch_decode(generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True)

num_tokens_generated = len(generated_ids[0])
load_time = end_load_time - start_load_time
generation_time = time.time() - start_time
tokens_per_second = num_tokens_generated / generation_time
average_token_latency = generation_time / num_tokens_generated

print("\nPerformance Report:")
print("-"*50)
print(f"Input Tokens        : {input_token_count:>9}")
print(f"Generated Tokens    : {num_tokens_generated:>9}")
print(f"Model Load Time     : {load_time:>9.2f} sec")
print(f"Generation Time     : {generation_time:>9.2f} sec")
print(f"Throughput          : {tokens_per_second:>9.2f} t/s")
print(f"Avg Latency/Token   : {average_token_latency:>9.3f} sec")

print(output_text)

What the Test Code Does

The test code demonstrates how to perform inference in Python and highlights the important parts of the code for benchmarking performance. Text generation presents different challenges compared to text generation with images. For example, vision encoders often use different strategies for handling the properties of an image, which can lead to higher memory usage, reduced throughput, or poor results.

📚 Documentation

Model Information

Property	Details
Model Type	Gemma 3 for OpenArc
Base Model	google/gemma-3-4b-it
Tags	OpenArc, OpenVINO, Optimum-Intel, image-text-to-text
License	Apache-2.0

📄 License

This project is licensed under the Apache-2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご