Gemma 3 4B開源模型 - 經OpenVINO優化，支持文本與視覺文本推理

首頁

Gemma 3 4b It Int8 Asym Ov

由Echo9Zulu開發

基於OpenVINO優化的Gemma 3 4B參數模型，支持文本到文本及視覺文本推理

圖像生成文本開源協議:Apache-2.0 #多模態文本生成 #Intel硬件優化 #低延遲推理

下載量 152

發布時間 : 4/12/2025

模型概述

該模型是Google Gemma 3 4B參數版本的OpenVINO優化版本，通過Optimum-Intel轉換為INT8格式，支持圖像文本到文本的多模態推理任務。

模型特點

OpenVINO優化

通過Intel OpenVINO工具套件優化，提升在Intel硬件上的推理性能

多模態支持

支持同時處理圖像和文本輸入，實現視覺文本推理

INT8量化

採用非對稱INT8量化技術，減少模型大小同時保持精度

低延遲優化

針對首詞元延遲進行特別優化，適合即時應用場景

模型能力

文本生成

圖像描述生成

多模態推理

對話系統

使用案例

內容生成

圖像描述生成

根據輸入圖像生成詳細描述

可生成準確反映圖像內容的文本描述

智能助手

視覺問答

回答關於圖像內容的自然語言問題

可理解圖像內容並提供相關回答

🚀 Gemma 3 for OpenArc 來襲！

本項目 OpenArc 是一個適用於 OpenVINO 的推理引擎，現已支持該模型，並通過與 OpenAI 兼容的端點為文本到文本以及文本與視覺任務提供推理服務！該版本將於今日或明日發佈。

我們擁有一個不斷壯大的 Discord 社區，社區成員都對使用英特爾技術進行人工智能/機器學習感興趣。

📦 安裝指南

此模型已使用以下 Optimum-CLI 命令轉換為 OpenVINO IR 格式：

optimum-cli export openvino -m ""input-model"" --task image-text-to-text --weight-format int8 ""converted-model""

可在此處查找 Optimum-CLI 導出過程的文檔。
可使用我的 HF 空間 Echo9Zulu/Optimum-CLI-Tool_tool 構建命令並在本地執行。

要運行測試代碼，需執行以下步驟：

安裝特定設備的驅動程序
從源代碼為 OpenVINO 構建 Optimum-Intel
準備一些高質量的圖像

pip install optimum[openvino]+https://github.com/huggingface/optimum-intel

💻 使用示例

基礎用法

import time
from PIL import Image
from transformers import AutoProcessor
from optimum.intel.openvino import OVModelForVisualCausalLM


model_id = "Echo9Zulu/gemma-3-4b-it-int8_asym-ov" # Can be an HF id or a path

ov_config = {"PERFORMANCE_HINT": "LATENCY"} # Optimizes for first token latency and locks to single CPU socket

print("Loading model... this should get faster after the first generation due to caching behavior.")
print("")
start_load_time = time.time()
model = OVModelForVisualCausalLM.from_pretrained(model_id, export=False, device="CPU", ov_config=ov_config) # For GPU use "GPU.0"
processor = AutoProcessor.from_pretrained(model_id) # Instead of using AutoTokenizers we use AutoProcessor which routes to the appropriate input processor i.e, how does a model expect image tokens.
                                                    # Under the hood this takes care of model specific preprocessing and has functionality overlap with AutoTokenizers.
end_load_time = time.time()

image_path = r"" # This script expects .png
image = Image.open(image_path)
image = image.convert("RGB") # Required by gemma3. In practice this would need to be handled at the engine level OR in model-specifc pre-processing.

conversation = [
    {
        "role": "user",
        "content": [
            {
                "type": "image"
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

text_prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)

inputs = processor(text=[text_prompt], images=[image], padding=True, return_tensors="pt")

input_token_count = len(inputs.input_ids[0]) 
print(f"Sum of image and text tokens: {len(inputs.input_ids[0])}")

start_time = time.time()
output_ids = model.generate(**inputs, max_new_tokens=1024)

generated_ids = [output_ids[len(input_ids) :] for input_ids, output_ids in zip(inputs.input_ids, output_ids)]
output_text = processor.batch_decode(generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True)

num_tokens_generated = len(generated_ids[0])
load_time = end_load_time - start_load_time
generation_time = time.time() - start_time
tokens_per_second = num_tokens_generated / generation_time
average_token_latency = generation_time / num_tokens_generated

print("\nPerformance Report:")
print("-"*50)
print(f"Input Tokens        : {input_token_count:>9}")
print(f"Generated Tokens    : {num_tokens_generated:>9}")
print(f"Model Load Time     : {load_time:>9.2f} sec")
print(f"Generation Time     : {generation_time:>9.2f} sec")
print(f"Throughput          : {tokens_per_second:>9.2f} t/s")
print(f"Avg Latency/Token   : {average_token_latency:>9.3f} sec")

print(output_text)