license: gemma
library_name: transformers
base_model: google/gemma-3-4b-it
datasets:
- junnei/covost2
metrics:
- bleu
- cer
- wer
pipeline_tag: automatic-speech-recognition
Gemma 3 MM model card
Terms of Use: Terms
Model Summary
Gemma-3-MM is a open multimodal instruction models that extend the
capabilities of the original Gemma-3 models to include speech processing.
These models leverage the language and vision research used in the
original Gemma-3 models and incorporate additional speech processing
capabilities through a Speech Adapter.
The models can process text, image, and audio inputs, generating text outputs, and come with a 128K token context length (32K for the 1B model).
Evaluation
Model evaluation metrics and results.
Here is Script to evaluate model.
ASR
AST
Benchmark |
Task |
BLEU ↑ |
Result |
Covost2 |
AST (0-shot, English-Korean) |
31.55 |
Link |
Fleurs |
AST (0-shot, English-Korean) |
11.05 |
Link |
Score is lower because Korean Normalizer is not applied
Benchmark |
Task |
BLEU ↑ |
CER ↓ |
WER ↓ |
Result |
Zeroth |
ASR (Korean) |
94.91 |
1.31 |
2.50 |
Link |
Fleurs |
ASR (Korean) |
62.83 |
9.08 |
23.0 |
Link |
Covost2 |
ASR (Korean) |
43.66 |
22.5 |
41.4 |
Link |
Model Details
Developed by: junnei
Model type: Multimodal (Text, Vision, Speech) Language Model
Language(s): Multilingual
License: Gemma
Base model: google/gemma-3-4b-it
Inspiration: Phi-4-multimodal-instruct
Training Details
-
The model was trained by adding a 596B parameter Speech LoRA adapter to the base Gemma-3-4b-it model.
-
Due to limited computational resources, the model was only trained for limited datasets and epochs on ASR (Automatic Speech Recognition) and AST (Automatic Speech Translation) tasks with A100 1 GPU.
-
The training data was limited to English and Korean languages within less than 30 seconds in duration.
Datasets
ASR / AST
Limitations
Note that this model is just a Proof of Concept (PoC) for experimental purposes and is not intended for production use.
To improve the model's performance and reliability, the following areas need further development:
-
More computational resources for extended training needed.
-
For now, the model only works for Vision-Language tasks and Audio-Language tasks (ASR/AST).
-
Due to the lack of computing resources,
this model primarily recognizes audio files less than 30 seconds in duration.
As a result, there is a limitation where the accuracy may drop significantly for longer audio inputs.
-
If possible, We will train the model for Speech-Vision Tasks and more Audio-Language tasks.
Usage
Below, there are some code snippets on how to get quickly started with running the model.
First, upgrade your Transformers library. AudioInput for chat_template is supported now.
$ pip install -U transformers
Then, copy the snippet from the section that is relevant for your use case.
Running the model with chat_template
from transformers import AutoProcessor, AutoModel
import torch
model_id = "junnei/gemma-3-4b-it-speech"
revision = "main"
model = AutoModel.from_pretrained(
model_id, device_map="auto", revision = revision, trust_remote_code=True
).eval()
processor = AutoProcessor.from_pretrained(
model_id, revision = revision, trust_remote_code=True
)
messages = [
{
"role": "user",
"content": [
{"type": "audio", "audio": "https://huggingface.co/microsoft/Phi-4-multimodal-instruct/resolve/main/examples/what_is_shown_in_this_image.wav"},
{"type": "text", "text": "Transcribe this audio clip into text."}
]
}
]
inputs = processor.apply_chat_template(
messages, add_generation_prompt=True, tokenize=True,
return_dict=True, return_tensors="pt"
)
with torch.inference_mode():
generate_ids = model.generate(**inputs, max_new_tokens=128, do_sample=False)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
print(response)
Running the model with raw data
from io import BytesIO
from urllib.request import urlopen
import soundfile
from PIL import Image
url = "https://huggingface.co/microsoft/Phi-4-multimodal-instruct/resolve/main/examples/what_is_shown_in_this_image.wav"
audio, sr = soundfile.read(BytesIO(urlopen(url).read()))
audio_token = '<start_of_audio>'
messages = [
{'role': 'user', 'content': audio_token + 'Translate this audio into Korean.'},
]
prompt = processor.tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
inputs = processor(text=prompt, audio=[audio], add_special_tokens=False, return_tensors="pt")
with torch.inference_mode():
generate_ids = model.generate(**inputs, max_new_tokens=128, do_sample=False)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
print(response)
Finetune the model
Here is finetuning script : Link
You must change output_dir, upload_dir and fit your Datasets
python finetune_speech.py
Citation
@article{gemma3mm_2025,
title={Gemma-3-MM: Multimodal Language Models with Speech Capabilities},
author={Seongjun Jang},
year={2025}
}