開源Gemma-3-4b-it-speech多模態模型 - 免費處理音文圖輸入並生成文本

首頁

Gemma 3 4b It Speech

由junnei開發

Gemma-3-MM是基於Gemma-3-4b-it擴展的多模態指令模型，新增語音處理能力，可處理文本、圖像和音頻輸入，生成文本輸出。

音頻生成文本

Transformers

#多模態語音識別 #英韓語音翻譯 #短音頻處理

下載量 383

發布時間 : 3/22/2025

模型概述

開源多模態指令模型，在Gemma-3基礎上擴展語音處理能力，支持英語和韓語的語音識別與翻譯任務。

模型特點

多模態處理能力

可同時處理文本、圖像和音頻輸入，生成文本輸出

長上下文支持

支持128K token的上下文長度(1B模型為32K)

語音適配器

通過添加596B參數的LoRA適配器擴展語音處理功能

多語言支持

支持英語和韓語的語音識別與翻譯

模型能力

文本生成

語音識別

語音翻譯

多模態理解

使用案例

語音轉寫

英語語音轉錄

將英語語音轉換為文本

在LibriSpeech清潔版測試集上達到94.28 BLEU分數

韓語語音轉錄

將韓語語音轉換為文本

在Zeroth測試集上達到94.91 BLEU分數

語音翻譯

英韓翻譯

將英語語音翻譯為韓語文本

在Covost2測試集上達到31.55 BLEU分數

🚀 Gemma 3 MM模型卡片

Gemma-3-MM 是一個開源的多模態指令模型，它在原始Gemma-3模型的基礎上進行了擴展，增加了語音處理能力。這些模型利用了原始Gemma-3模型中的語言和視覺研究成果，並通過語音適配器融入了額外的語音處理能力。該模型可以處理文本、圖像和音頻輸入，生成文本輸出，並且具有128K的令牌上下文長度（1B模型為32K）。

🚀 快速開始

運行環境準備

首先，升級你的Transformers庫。現在支持在聊天模板中使用音頻輸入。

$ pip install -U transformers

運行模型

根據不同的使用場景，你可以選擇以下兩種方式運行模型：

基於聊天模板運行模型

from transformers import AutoProcessor, AutoModel
import torch

model_id = "junnei/gemma-3-4b-it-speech"
revision = "main" # 或者 "korean"。

model = AutoModel.from_pretrained(
    model_id, device_map="auto", revision = revision, trust_remote_code=True
).eval()

processor = AutoProcessor.from_pretrained(
    model_id, revision = revision, trust_remote_code=True
)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "audio", "audio": "https://huggingface.co/microsoft/Phi-4-multimodal-instruct/resolve/main/examples/what_is_shown_in_this_image.wav"},
            {"type": "text", "text": "將這段音頻轉錄為文本。"}
        ]
    }
]

inputs = processor.apply_chat_template(
    messages, add_generation_prompt=True, tokenize=True,
    return_dict=True, return_tensors="pt"
)

with torch.inference_mode():
    generate_ids = model.generate(**inputs, max_new_tokens=128, do_sample=False)
    generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
    response = processor.batch_decode(
        generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
    )[0]
print(response)

# 這段音頻裡說了什麼？

基於原始數據運行模型

from io import BytesIO
from urllib.request import urlopen
import soundfile
from PIL import Image

# 從URL獲取音頻數據
url = "https://huggingface.co/microsoft/Phi-4-multimodal-instruct/resolve/main/examples/what_is_shown_in_this_image.wav"
audio, sr = soundfile.read(BytesIO(urlopen(url).read()))
audio_token = '<start_of_audio>'

messages = [
    {'role': 'user', 'content': audio_token + '將這段音頻翻譯成韓語。'}
]

prompt = processor.tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)

inputs = processor(text=prompt, audio=[audio], add_special_tokens=False, return_tensors="pt")

with torch.inference_mode():
    generate_ids = model.generate(**inputs, max_new_tokens=128, do_sample=False)
    generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
    response = processor.batch_decode(
        generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
    )[0]
print(response)

模型微調

這裡有一個微調腳本：鏈接

你必須更改輸出目錄、上傳目錄，並適配你的數據集

python finetune_speech.py

✨ 主要特性

多模態處理：能夠處理文本、圖像和音頻輸入，生成文本輸出。
語音處理能力：在原始Gemma-3模型的基礎上增加了語音處理能力。
長上下文長度：具有128K的令牌上下文長度（1B模型為32K）。

📦 安裝指南

升級Transformers庫：

$ pip install -U transformers

💻 使用示例

基礎用法

基於聊天模板運行模型，示例代碼如下：

from transformers import AutoProcessor, AutoModel
import torch

model_id = "junnei/gemma-3-4b-it-speech"
revision = "main" # 或者 "korean"。

model = AutoModel.from_pretrained(
    model_id, device_map="auto", revision = revision, trust_remote_code=True
).eval()

processor = AutoProcessor.from_pretrained(
    model_id, revision = revision, trust_remote_code=True
)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "audio", "audio": "https://huggingface.co/microsoft/Phi-4-multimodal-instruct/resolve/main/examples/what_is_shown_in_this_image.wav"},
            {"type": "text", "text": "將這段音頻轉錄為文本。"}
        ]
    }
]

inputs = processor.apply_chat_template(
    messages, add_generation_prompt=True, tokenize=True,
    return_dict=True, return_tensors="pt"
)

with torch.inference_mode():
    generate_ids = model.generate(**inputs, max_new_tokens=128, do_sample=False)
    generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
    response = processor.batch_decode(
        generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
    )[0]
print(response)

# 這段音頻裡說了什麼？

高級用法

基於原始數據運行模型，示例代碼如下：

from io import BytesIO
from urllib.request import urlopen
import soundfile
from PIL import Image

# 從URL獲取音頻數據
url = "https://huggingface.co/microsoft/Phi-4-multimodal-instruct/resolve/main/examples/what_is_shown_in_this_image.wav"
audio, sr = soundfile.read(BytesIO(urlopen(url).read()))
audio_token = '<start_of_audio>'

messages = [
    {'role': 'user', 'content': audio_token + '將這段音頻翻譯成韓語。'}
]

prompt = processor.tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)

inputs = processor(text=prompt, audio=[audio], add_special_tokens=False, return_tensors="pt")

with torch.inference_mode():
    generate_ids = model.generate(**inputs, max_new_tokens=128, do_sample=False)
    generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
    response = processor.batch_decode(
        generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
    )[0]
print(response)

📚 詳細文檔

評估

模型評估指標和結果。這裡有一個評估腳本用於評估模型。

ASR評估結果

基準測試	任務	BLEU ↑	CER ↓	WER ↓	結果
Covost2	ASR (英語)	86.09	4.12	7.83	鏈接
Fleurs	ASR (英語)	89.61	2.28	5.23	鏈接
LibriSpeech-Clean	ASR (英語)	94.28	0.98	2.91	鏈接
LibriSpeech-Other	ASR (英語)	87.60	3.10	6.55	鏈接

AST評估結果

基準測試	任務	BLEU ↑	結果
Covost2	AST (零樣本，英語 - 韓語)	31.55	鏈接
Fleurs	AST (零樣本，英語 - 韓語)	11.05	鏈接

(實驗性) ASR：韓語分支

由於未應用韓語歸一化器，分數較低。

基準測試	任務	BLEU ↑	CER ↓	WER ↓	結果
Zeroth	ASR (韓語)	94.91	1.31	2.50	鏈接
Fleurs	ASR (韓語)	62.83	9.08	23.0	鏈接
Covost2	ASR (韓語)	43.66	22.5	41.4	鏈接

模型詳情

開發者：junnei
模型類型：多模態（文本、視覺、語音）語言模型
支持語言：多語言
許可證：Gemma
基礎模型：google/gemma-3-4b-it
靈感來源：Phi-4-multimodal-instruct

訓練詳情

訓練方式：在基礎Gemma-3-4b-it模型上添加了一個596B參數的語音LoRA適配器。
訓練資源限制：由於計算資源有限，該模型僅在ASR（自動語音識別）和AST（自動語音翻譯）任務上，使用A100 1 GPU對有限的數據集和輪次進行了訓練。
訓練數據限制：訓練數據僅限於英語和韓語，且音頻時長小於30秒。

數據集

ASR / AST

侷限性

請注意，該模型僅用於實驗目的，是一個概念驗證（PoC），不適合用於生產環境。為了提高模型的性能和可靠性，以下方面需要進一步開發：

更多計算資源：需要更多的計算資源進行擴展訓練。
任務範圍：目前，該模型僅適用於視覺 - 語言任務和音頻 - 語言任務（ASR/AST）。
音頻時長限制：由於計算資源不足，該模型主要識別時長小於30秒的音頻文件。因此，對於較長的音頻輸入，準確性可能會顯著下降。
未來計劃：如果可能，將對模型進行訓練以支持語音 - 視覺任務和更多的音頻 - 語言任務。

🔧 技術細節

模型在原始Gemma-3模型的基礎上，通過語音適配器融入了額外的語音處理能力。
訓練時在基礎Gemma-3-4b-it模型上添加了一個596B參數的語音LoRA適配器。

📄 許可證

本模型使用的許可證為 Gemma。

📖 引用

@article{gemma3mm_2025,
    title={Gemma-3-MM: Multimodal Language Models with Speech Capabilities},
    author={Seongjun Jang},
    year={2025}
}

精選推薦AI模型

Llama 3 Typhoon V1.5x 8b Instruct

專為泰語設計的80億參數指令模型，性能媲美GPT-3.5-turbo，優化了應用場景、檢索增強生成、受限生成和推理任務

Cadet-Tiny是一個基於SODA數據集訓練的超小型對話模型，專為邊緣設備推理設計，體積僅為Cosmo-3B模型的2%左右。

Roberta Base Chinese Extractive Qa

基於RoBERTa架構的中文抽取式問答模型，適用於從給定文本中提取答案的任務。

智啟未來，您的人工智能解決方案智庫