Phi-4-mm-inst-asr-singlish Open-source Model - Enhance the Recognition Ability of Singapore English Speech Features!

Phi 4 Mm Inst Asr Singlish

Developed by mjwong

A multimodal speech recognition model optimized for Singapore English, fine-tuned based on Microsoft's Phi-4 multimodal instruction model, significantly improving recognition of Singapore English's unique phonetic features.

Audio-to-Text

Transformers

Supports Multiple LanguagesOpen Source License:MIT #Singapore English Speech Recognition #Multimodal Instruction Model #Low Word Error Rate

Downloads 61

Release Time : 5/1/2025

Model Overview

This model addresses the insufficient representation of regional dialects in general large language models, specifically optimizing for code-switching and unique prosody in Singapore English (Singlish), achieving the unified vision of a 'listen-understand-respond naturally' model.

Model Features

Singapore English Optimization

Specifically optimized for code-switching and unique prosody in Singapore English, significantly improving recognition accuracy.

Multimodal Capabilities

Based on Phi-4 multimodal instruction model, capable of processing both audio and text modalities.

Efficient Fine-Tuning

Only unfreezes audio-related modules during training, efficiently adapting to Singapore English while maintaining core language understanding capabilities.

Smart Termination

Through end token training, the model accurately determines transcription endpoints, avoiding redundant outputs.

Model Capabilities

Singapore English Speech Recognition

Multimodal Understanding

Speech Transcription

Voice-First Agent Development

Use Cases

Speech Transcription

Singapore English Conversation Transcription

Transcribes daily conversations featuring Singapore English characteristics into text

Word Error Rate (WER) as low as 13.16%

Smart Assistants

Singapore English Voice Assistant

Develops voice-first assistants capable of understanding Singapore English accents

Achieves unified 'listen-understand-respond naturally' model

🚀 Phi-4-mm-inst-asr-singlish

Phi-4-multimodal-instruct-asr-singlish (Phi-4-mm-inst-asr-singlish) aims to tackle a significant limitation of broad Large Multimodal Models (LMMs) like Microsoft’s Phi-4: the under - representation of regional dialects. Singlish, with its code - switching and unique prosody, often confounds generic models.

However, Phi-4 has undergone extensive pre - training, which has captured complex linguistic structures. This promises better generalization than smaller Automatic Speech Recognition (ASR) systems such as Whisper. This targeted adaptation of Phi-4-multimodal-instruct (Phi-4-mm-inst) is a step towards the broader goal of a unified model that can listen, understand, and respond naturally. It lays the foundation for voice - first agents that can reason, translate, and generate code seamlessly within a single context.

✨ Features

Addresses the under - representation of Singlish in large - scale models.
Combines near - state - of - the - art ASR with a full generative LLM.
Learns task - specific stopping during fine - tuning.

📦 Installation

For first - time use, you might need to install the additional libraries below:

!pip install backoff

!sudo apt-get install -y cmake ninja-build
!pip install wheel

from pkg_resources import get_distribution, DistributionNotFound

package_name = 'flash_attn'

try:
  dist = get_distribution(package_name)
  print(f"'{package_name}' version {dist.version} is already installed.")
except DistributionNotFound:
  !MAX_JOBS=8 pip install flash-attn --no-build-isolation

💻 Usage Examples

Basic Usage

import torch
import soundfile
from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig

model_path = "mjwong/Phi-4-mm-inst-asr-singlish"

kwargs = {}
kwargs['torch_dtype'] = torch.bfloat16

processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)

model = AutoModelForCausalLM.from_pretrained(
    model_path,
    trust_remote_code=True,
    torch_dtype='auto',
    _attn_implementation='flash_attention_2',
).cuda()

generation_config = GenerationConfig.from_pretrained(model_path, 'generation_config.json')

user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'

speech_prompt = "Based on the attached audio, generate a comprehensive text transcription of the spoken content."
prompt = f'{user_prompt}<|audio_1|>{speech_prompt}{prompt_suffix}{assistant_prompt}'

Advanced Usage

You can then transcribe audios of arbitrary length. As an illustration, the audio file ignite.wav can be downloaded from this link.

audio = soundfile.read('./ignite.wav')

inputs = processor(text=prompt, audios=[audio], return_tensors='pt').to('cuda:0')

generate_ids = model.generate(
    **inputs,
    max_new_tokens=1200,
    generation_config=generation_config,
    num_logits_to_keep=1,
)

generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]

response = processor.batch_decode(
    generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]

print(response)

📚 Documentation

Model Details

Property	Details
Developed by	Ming Jie Wong
Base Model	microsoft/Phi-4-multimodal-instruct
Model Type	Decoder - only Transformer with vision / speech adapters
Metrics	Word Error Rate (WER)
Languages Supported	English (with a focus on Singlish)
License	MIT

Description

This work uses supervised fine - tuning (SFT) of Phi-4-mm-inst for Singlish ASR, leveraging 66.9k paired audio–transcript examples. The dataset is derived solely from the Part 3 Same Room Environment Close - talk Mic recordings of [IMDA's NSC Corpus](https://www.imda.gov.sg/how - we - can - help/national - speech - corpus).

Rather than retraining all model parameters, only the audio_embed module (specifically its encoder and audio projection layers) is selectively unfrozen, while the remaining weights are kept fixed. During training, each audio clip is paired with its ground - truth transcript, and a dedicated end - of - transcription marker (<|end|><|endoftext|>) is appended. A standard cross - entropy loss is then optimized over the token sequences, teaching the model to transcribe audio features into text and generate the marker at the end of transcription. This data - driven approach focuses computational resources on adapting the model’s audio processing to Singlish’s unique phonetic, prosodic, and code - switching characteristics without altering its core language understanding.

The original Part 3 of the National Speech Corpus consists of approximately 1,000 hours of conversational speech from around 1,000 local English speakers, recorded in pairs. These conversations cover everyday topics and include interactive game - based dialogues. Recordings were made in two environments:

Same Room, where speakers shared a room and were recorded using a close - talk mic and a boundary mic.
Separate Room, where each speaker was recorded individually using a standing mic and a telephone (IVR).

Audio segments for the internal dataset were extracted using these criteria:

Minimum Word Count: 10 words
- This threshold was chosen to ensure that each audio segment contains sufficient linguistic context for the model to better understand instructions in Singlish. Shorter segments may bias the model towards specific utterances or phrases, limiting its overall comprehension.
Maximum Duration: 20 seconds
- This threshold was chosen to provide enough context for accurate transcription while minimizing noise and computational complexity for longer audio segments.
Sampling Rate: All audio segments are down - sampled to 16kHz.

Full experiment details will be added soon.

Fine - Tuning Details

Fine - tuning was applied on a single A100 - 80GB GPU.

Training Hyperparameters

The following hyperparameters were used:

learning_rate: 0.0001
train_batch_size: 8
eval_batch_size: 8
seed: 42
Optimizer:
- Name: ADAMW_TORCH
- Betas: (0.9, 0.99)
- Epsilon: 1e - 07
- Optimizer Arguments: No additional optimizer arguments
lr_scheduler_type: cosine
lr_scheduler_warmup_ratio: 0.1
num_epochs: 1

Benchmark Performance

We evaluated Phi-4-mm-inst-asr-singlish on the following datasets:

SASRBench-v1: A benchmark dataset for evaluating ASR performance on Singlish.
AMI: A widely used dataset for meeting transcription and diarization tasks. This work specifically uses the IHM (Individual Headset Microphone) recordings.
GigaSpeech: A large - scale open - source dataset with diverse English audio, covering read, conversational, and spontaneous speech.

Model Performance

Dataset	Model	Rel. RTFx	WER
SASRBench-v1	microsoft/Phi-4-multimodal-instruct	1.00	33.00%
SASRBench-v1	mjwong/Phi-4-mm-inst-asr-singlish	1.03	13.16%
SASRBench-v1	mjwong/whisper-large-v3-singlish	2.60	16.41%
SASRBench-v1	mjwong/whisper-large-v3-turbo-singlish	6.13	13.35%
SASRBench-v1	mjwong/whisper-large-v3-singlish + DRAFT	5.72	14.84%

AMI	microsoft/Phi-4-multimodal-instruct	1.00	14.74%
AMI	mjwong/Phi-4-mm-inst-asr-singlish	1.11	20.23%
AMI	mjwong/whisper-large-v3-singlish	1.14	23.72%
AMI	mjwong/whisper-large-v3-turbo-singlish	1.75	16.99%
AMI	mjwong/whisper-large-v3-singlish + DRAFT	2.59	22.06%

GigaSpeech	microsoft/Phi-4-multimodal-instruct	1.00	24.65%
GigaSpeech	mjwong/Phi-4-mm-inst-asr-singlish	1.20	10.34%
GigaSpeech	mjwong/whisper-large-v3-singlish	2.03	13.15%
GigaSpeech	mjwong/whisper-large-v3-turbo-singlish	3.97	11.54%
GigaSpeech	mjwong/whisper-large-v3-singlish + DRAFT	4.81	12.81%

Experimental Observations

Base vs. Fine‑Tuned Behavior

Base Model: Phi‑4’s generalist design allows instruction - based transcription but lacks a robust stopping criterion. When prompted to generate a fixed number of tokens, it often continues past the audio’s end, repeating or fabricating tokens until the max_new_tokens limit or an implicit end - of - sequence signal is reached.
Fine‑Tuned Model: By associating the end - of - transcription markers during training, the model learned task - specific stopping. Even with a high max_new_tokens setting, it reliably generates <|end|><|endoftext|> immediately after completing the actual transcription, avoiding extraneous output.

Behavior on Long Audio Clips

The output length is bounded by max_new_tokens, regardless of input duration. For clips requiring fewer tokens than the limit, the fine - tuned model stops cleanly at the marker. For longer clips, it produces a truncated but well - formed transcription up to the token limit, without failing or crashing.

Conclusion

Fine - tuning Phi-4-mm-inst reduces its Singlish WER from 33% to 13.16%, closing—and slightly beating—the gap to our best - performing fine - tuned [Whisper - large - v3 - turbo - singlish](https://huggingface.co/mjwong/whisper - large - v3 - turbo - singlish). While the advantage over Whisper is small, Phi-4’s real value lies in combining near–state - of - the - art ASR with a full generative LLM in one package. For Singlish speakers, this means a single model that can hear, understand, and respond natively, paving the way for voice - first agents that can reason, translate, or generate code without leaving the same context.

🔧 Technical Details

Model Adaptation

The selective unfreezing of the audio_embed module allows for targeted adaptation of the model to Singlish's unique characteristics without over - fitting or altering the core language understanding capabilities of Phi-4.

Training Strategy

The use of paired audio–transcript examples and the optimization of cross - entropy loss over token sequences with end - of - transcription markers enables the model to learn task - specific stopping and accurate transcription.

📄 License

This model is licensed under the MIT license.

⚠️ Important Note

While this model has been fine - tuned to better recognize Singlish, users may experience inaccuracies, biases, or unexpected outputs, particularly in challenging audio conditions or with speakers using non - standard variations. Use of this model is at your own risk; the developers and distributors are not liable for any consequences arising from its use. Please validate results before deploying in any sensitive or production environment.

Contact

For more information, please reach out to mingjwong@hotmail.com.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご