Gemma 3 4B开源模型 - 经OpenVINO优化，支持文本与视觉文本推理

首页

Gemma 3 4b It Int8 Asym Ov

由 Echo9Zulu 开发

基于OpenVINO优化的Gemma 3 4B参数模型，支持文本到文本及视觉文本推理

图像生成文本开源协议:Apache-2.0 #多模态文本生成 #Intel硬件优化 #低延迟推理

下载量 152

发布时间 : 4/12/2025

模型简介

该模型是Google Gemma 3 4B参数版本的OpenVINO优化版本，通过Optimum-Intel转换为INT8格式，支持图像文本到文本的多模态推理任务。

模型特点

OpenVINO优化

通过Intel OpenVINO工具套件优化，提升在Intel硬件上的推理性能

多模态支持

支持同时处理图像和文本输入，实现视觉文本推理

INT8量化

采用非对称INT8量化技术，减少模型大小同时保持精度

低延迟优化

针对首词元延迟进行特别优化，适合实时应用场景

模型能力

文本生成

图像描述生成

多模态推理

对话系统

使用案例

内容生成

图像描述生成

根据输入图像生成详细描述

可生成准确反映图像内容的文本描述

智能助手

视觉问答

回答关于图像内容的自然语言问题

可理解图像内容并提供相关回答

🚀 Gemma 3 for OpenArc 来袭！

本项目 OpenArc 是一个适用于 OpenVINO 的推理引擎，现已支持该模型，并通过与 OpenAI 兼容的端点为文本到文本以及文本与视觉任务提供推理服务！该版本将于今日或明日发布。

我们拥有一个不断壮大的 Discord 社区，社区成员都对使用英特尔技术进行人工智能/机器学习感兴趣。

📦 安装指南

此模型已使用以下 Optimum-CLI 命令转换为 OpenVINO IR 格式：

optimum-cli export openvino -m ""input-model"" --task image-text-to-text --weight-format int8 ""converted-model""

可在此处查找 Optimum-CLI 导出过程的文档。
可使用我的 HF 空间 Echo9Zulu/Optimum-CLI-Tool_tool 构建命令并在本地执行。

要运行测试代码，需执行以下步骤：

安装特定设备的驱动程序
从源代码为 OpenVINO 构建 Optimum-Intel
准备一些高质量的图像

pip install optimum[openvino]+https://github.com/huggingface/optimum-intel

💻 使用示例

基础用法

import time
from PIL import Image
from transformers import AutoProcessor
from optimum.intel.openvino import OVModelForVisualCausalLM


model_id = "Echo9Zulu/gemma-3-4b-it-int8_asym-ov" # Can be an HF id or a path

ov_config = {"PERFORMANCE_HINT": "LATENCY"} # Optimizes for first token latency and locks to single CPU socket

print("Loading model... this should get faster after the first generation due to caching behavior.")
print("")
start_load_time = time.time()
model = OVModelForVisualCausalLM.from_pretrained(model_id, export=False, device="CPU", ov_config=ov_config) # For GPU use "GPU.0"
processor = AutoProcessor.from_pretrained(model_id) # Instead of using AutoTokenizers we use AutoProcessor which routes to the appropriate input processor i.e, how does a model expect image tokens.
                                                    # Under the hood this takes care of model specific preprocessing and has functionality overlap with AutoTokenizers.
end_load_time = time.time()

image_path = r"" # This script expects .png
image = Image.open(image_path)
image = image.convert("RGB") # Required by gemma3. In practice this would need to be handled at the engine level OR in model-specifc pre-processing.

conversation = [
    {
        "role": "user",
        "content": [
            {
                "type": "image"
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

text_prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)

inputs = processor(text=[text_prompt], images=[image], padding=True, return_tensors="pt")

input_token_count = len(inputs.input_ids[0]) 
print(f"Sum of image and text tokens: {len(inputs.input_ids[0])}")

start_time = time.time()
output_ids = model.generate(**inputs, max_new_tokens=1024)

generated_ids = [output_ids[len(input_ids) :] for input_ids, output_ids in zip(inputs.input_ids, output_ids)]
output_text = processor.batch_decode(generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True)

num_tokens_generated = len(generated_ids[0])
load_time = end_load_time - start_load_time
generation_time = time.time() - start_time
tokens_per_second = num_tokens_generated / generation_time
average_token_latency = generation_time / num_tokens_generated

print("\nPerformance Report:")
print("-"*50)
print(f"Input Tokens        : {input_token_count:>9}")
print(f"Generated Tokens    : {num_tokens_generated:>9}")
print(f"Model Load Time     : {load_time:>9.2f} sec")
print(f"Generation Time     : {generation_time:>9.2f} sec")
print(f"Throughput          : {tokens_per_second:>9.2f} t/s")
print(f"Avg Latency/Token   : {average_token_latency:>9.3f} sec")

print(output_text)