G

Gemma 3 4b It Speech

Developed by junnei
Gemma-3-MM is a multimodal instruction model extended from Gemma-3-4b-it with added speech processing capabilities, capable of handling text, image, and audio inputs to generate text outputs.
Downloads 383
Release Time : 3/22/2025

Model Overview

An open-source multimodal instruction model that extends speech processing capabilities based on Gemma-3, supporting English and Korean speech recognition and translation tasks.

Model Features

Multimodal processing capability
Can simultaneously process text, image, and audio inputs to generate text outputs
Long context support
Supports context lengths of up to 128K tokens (32K for 1B model)
Speech adapter
Extends speech processing functionality by adding a 596B-parameter LoRA adapter
Multilingual support
Supports speech recognition and translation for English and Korean

Model Capabilities

Text generation
Speech recognition
Speech translation
Multimodal understanding

Use Cases

Speech transcription
English speech transcription
Convert English speech to text
Achieved a BLEU score of 94.28 on the LibriSpeech clean test set
Korean speech transcription
Convert Korean speech to text
Achieved a BLEU score of 94.91 on the Zeroth test set
Speech translation
English-Korean translation
Translate English speech to Korean text
Achieved a BLEU score of 31.55 on the Covost2 test set
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase