Gemma 3 4b It Speech

Developed by junnei

Gemma-3-MM is a multimodal instruction model extended from Gemma-3-4b-it with added speech processing capabilities, capable of handling text, image, and audio inputs to generate text outputs.

Audio-to-Text

Transformers

#Multimodal speech recognition #English-Korean speech translation #Short audio processing

Downloads 383

Release Time : 3/22/2025

Model Overview

An open-source multimodal instruction model that extends speech processing capabilities based on Gemma-3, supporting English and Korean speech recognition and translation tasks.

Model Features

Multimodal processing capability

Can simultaneously process text, image, and audio inputs to generate text outputs

Long context support

Supports context lengths of up to 128K tokens (32K for 1B model)

Speech adapter

Extends speech processing functionality by adding a 596B-parameter LoRA adapter

Multilingual support

Supports speech recognition and translation for English and Korean

Model Capabilities

Text generation

Speech recognition

Speech translation

Multimodal understanding

Use Cases

Speech transcription

English speech transcription

Convert English speech to text

Achieved a BLEU score of 94.28 on the LibriSpeech clean test set

Korean speech transcription

Convert Korean speech to text

Achieved a BLEU score of 94.91 on the Zeroth test set

Speech translation

English-Korean translation

Translate English speech to Korean text

Achieved a BLEU score of 31.55 on the Covost2 test set

license: gemma library_name: transformers base_model: google/gemma-3-4b-it datasets:

junnei/covost2 metrics:
bleu
cer
wer pipeline_tag: automatic-speech-recognition

Gemma 3 MM model card

Terms of Use: Terms

Model Summary

Gemma-3-MM is a open multimodal instruction models that extend the capabilities of the original Gemma-3 models to include speech processing.

These models leverage the language and vision research used in the original Gemma-3 models and incorporate additional speech processing capabilities through a Speech Adapter.

The models can process text, image, and audio inputs, generating text outputs, and come with a 128K token context length (32K for the 1B model).

Evaluation

Model evaluation metrics and results.

Here is Script to evaluate model.

ASR

Benchmark	Task	BLEU ↑	CER ↓	WER ↓	Result
Covost2	ASR (English)	86.09	4.12	7.83	Link
Fleurs	ASR (English)	89.61	2.28	5.23	Link
LibriSpeech-Clean	ASR (English)	94.28	0.98	2.91	Link
LibriSpeech-Other	ASR (English)	87.60	3.10	6.55	Link

AST

Benchmark	Task	BLEU ↑	Result
Covost2	AST (0-shot, English-Korean)	31.55	Link
Fleurs	AST (0-shot, English-Korean)	11.05	Link

(Experimental) ASR : Korean Branch

Score is lower because Korean Normalizer is not applied

Benchmark	Task	BLEU ↑	CER ↓	WER ↓	Result
Zeroth	ASR (Korean)	94.91	1.31	2.50	Link
Fleurs	ASR (Korean)	62.83	9.08	23.0	Link
Covost2	ASR (Korean)	43.66	22.5	41.4	Link

Model Details

Developed by: junnei

Model type: Multimodal (Text, Vision, Speech) Language Model

Language(s): Multilingual

License: Gemma

Base model: google/gemma-3-4b-it

Inspiration: Phi-4-multimodal-instruct

Training Details

The model was trained by adding a 596B parameter Speech LoRA adapter to the base Gemma-3-4b-it model.
Due to limited computational resources, the model was only trained for limited datasets and epochs on ASR (Automatic Speech Recognition) and AST (Automatic Speech Translation) tasks with A100 1 GPU.
The training data was limited to English and Korean languages within less than 30 seconds in duration.

Datasets

ASR / AST

Limitations

Note that this model is just a Proof of Concept (PoC) for experimental purposes and is not intended for production use. To improve the model's performance and reliability, the following areas need further development:

More computational resources for extended training needed.
For now, the model only works for Vision-Language tasks and Audio-Language tasks (ASR/AST).
Due to the lack of computing resources, this model primarily recognizes audio files less than 30 seconds in duration. As a result, there is a limitation where the accuracy may drop significantly for longer audio inputs.
If possible, We will train the model for Speech-Vision Tasks and more Audio-Language tasks.

Usage

Below, there are some code snippets on how to get quickly started with running the model.

First, upgrade your Transformers library. AudioInput for chat_template is supported now.

$ pip install -U transformers

Then, copy the snippet from the section that is relevant for your use case.

Running the model with chat_template

from transformers import AutoProcessor, AutoModel
import torch

model_id = "junnei/gemma-3-4b-it-speech"
revision = "main" # or "korean".

model = AutoModel.from_pretrained(
    model_id, device_map="auto", revision = revision, trust_remote_code=True
).eval()

processor = AutoProcessor.from_pretrained(
    model_id, revision = revision, trust_remote_code=True
)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "audio", "audio": "https://huggingface.co/microsoft/Phi-4-multimodal-instruct/resolve/main/examples/what_is_shown_in_this_image.wav"},
            {"type": "text", "text": "Transcribe this audio clip into text."}
        ]
    }
]

inputs = processor.apply_chat_template(
    messages, add_generation_prompt=True, tokenize=True,
    return_dict=True, return_tensors="pt"
)

with torch.inference_mode():
    generate_ids = model.generate(**inputs, max_new_tokens=128, do_sample=False)
    generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
    response = processor.batch_decode(
        generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
    )[0]
print(response)

# What is shown in this image?

Running the model with raw data

from io import BytesIO
from urllib.request import urlopen
import soundfile
from PIL import Image


# get Audio data from URL
url = "https://huggingface.co/microsoft/Phi-4-multimodal-instruct/resolve/main/examples/what_is_shown_in_this_image.wav"
audio, sr = soundfile.read(BytesIO(urlopen(url).read()))
audio_token = '<start_of_audio>'


messages = [
    {'role': 'user', 'content': audio_token + 'Translate this audio into Korean.'},
]

prompt = processor.tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)


inputs = processor(text=prompt, audio=[audio], add_special_tokens=False, return_tensors="pt")

with torch.inference_mode():
    generate_ids = model.generate(**inputs, max_new_tokens=128, do_sample=False)
    generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
    response = processor.batch_decode(
        generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
    )[0]
print(response)

Finetune the model

Here is finetuning script : Link

You must change output_dir, upload_dir and fit your Datasets

python finetune_speech.py

Citation

@article{gemma3mm_2025,
    title={Gemma-3-MM: Multimodal Language Models with Speech Capabilities},
    author={Seongjun Jang},
    year={2025}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご