đ Gemma 3 for OpenArc
My Project OpenArc, an inference engine for OpenVINO, now supports Gemma 3 model and provides inference services over OpenAI-compatible endpoints for both text-to-text and text-with-vision tasks.
đ Quick Start
Model Compatibility
My Project OpenArc, an inference engine for OpenVINO, now supports the Gemma 3 model. It offers inference services over OpenAI-compatible endpoints for text-to-text and text-with-vision tasks. The release is scheduled for today or tomorrow.
Community
We have a growing Discord community of users interested in using Intel for AI/ML.

đĻ Installation
Convert to OpenVINO IR Format
This model was converted to the OpenVINO IR format using the following Optimum-CLI command:
optimum-cli export openvino -m ""input-model"" --task image-text-to-text --weight-format int8 ""converted-model""
Install Dependencies
To run the test code, you need to:
- Install device specific drivers
- Build Optimum-Intel for OpenVINO from source
- Find your spiciest images to get that AGI refusal smell
pip install optimum[openvino]+https://github.com/huggingface/optimum-intel
đģ Usage Examples
Basic Usage
import time
from PIL import Image
from transformers import AutoProcessor
from optimum.intel.openvino import OVModelForVisualCausalLM
model_id = "Echo9Zulu/gemma-3-4b-it-int8_asym-ov"
ov_config = {"PERFORMANCE_HINT": "LATENCY"}
print("Loading model... this should get faster after the first generation due to caching behavior.")
print("")
start_load_time = time.time()
model = OVModelForVisualCausalLM.from_pretrained(model_id, export=False, device="CPU", ov_config=ov_config)
processor = AutoProcessor.from_pretrained(model_id)
end_load_time = time.time()
image_path = r""
image = Image.open(image_path)
image = image.convert("RGB")
conversation = [
{
"role": "user",
"content": [
{
"type": "image"
},
{"type": "text", "text": "Describe this image."},
],
}
]
text_prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
inputs = processor(text=[text_prompt], images=[image], padding=True, return_tensors="pt")
input_token_count = len(inputs.input_ids[0])
print(f"Sum of image and text tokens: {len(inputs.input_ids[0])}")
start_time = time.time()
output_ids = model.generate(**inputs, max_new_tokens=1024)
generated_ids = [output_ids[len(input_ids) :] for input_ids, output_ids in zip(inputs.input_ids, output_ids)]
output_text = processor.batch_decode(generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True)
num_tokens_generated = len(generated_ids[0])
load_time = end_load_time - start_load_time
generation_time = time.time() - start_time
tokens_per_second = num_tokens_generated / generation_time
average_token_latency = generation_time / num_tokens_generated
print("\nPerformance Report:")
print("-"*50)
print(f"Input Tokens : {input_token_count:>9}")
print(f"Generated Tokens : {num_tokens_generated:>9}")
print(f"Model Load Time : {load_time:>9.2f} sec")
print(f"Generation Time : {generation_time:>9.2f} sec")
print(f"Throughput : {tokens_per_second:>9.2f} t/s")
print(f"Avg Latency/Token : {average_token_latency:>9.3f} sec")
print(output_text)
What the Test Code Does
The test code demonstrates how to perform inference in Python and highlights the important parts of the code for benchmarking performance. Text generation presents different challenges compared to text generation with images. For example, vision encoders often use different strategies for handling the properties of an image, which can lead to higher memory usage, reduced throughput, or poor results.
đ Documentation
Model Information
Property |
Details |
Model Type |
Gemma 3 for OpenArc |
Base Model |
google/gemma-3-4b-it |
Tags |
OpenArc, OpenVINO, Optimum-Intel, image-text-to-text |
License |
Apache-2.0 |
đ License
This project is licensed under the Apache-2.0 license.