Phi - 4 - multimodal - instruct - commonvoice - zh - tw Open - source Model: Accurately Complete Taiwan Mandarin Speech Recognition

Phi 4 Multimodal Instruct Commonvoice Zh Tw

Developed by JacobLinCool

A Taiwanese Mandarin speech recognition model fine-tuned from microsoft/Phi-4-multimodal-instruct, trained on the Taiwanese Mandarin General Voice 19.0 dataset

Audio-to-Text

Transformers

ChineseOpen Source License:MIT #Taiwanese Mandarin ASR #Multimodal Speech Recognition #Low Character Error Rate

Downloads 28

Release Time : 3/13/2025

Model Overview

An automatic speech recognition model optimized for Taiwanese Mandarin (zh-TW), capable of converting Taiwanese Mandarin speech into Traditional Chinese text

Model Features

Taiwanese Mandarin Optimization

Specifically optimized for Taiwanese Mandarin speech patterns and vocabulary

Multimodal Capabilities

Based on a multimodal foundation model with the ability to process audio input

Efficient Fine-tuning

Uses LoRA adapters for efficient fine-tuning, preserving the base model's capabilities while optimizing speech recognition performance

Model Capabilities

Taiwanese Mandarin speech recognition

Audio-to-text conversion

Automatic subtitle generation

Use Cases

Speech-to-text

Meeting Minutes

Convert Taiwanese Mandarin meeting recordings into text transcripts

CER 6.67%, WER 31.18%

Content Subtitles

Generate automatic subtitles for Taiwanese Mandarin video content

🚀 Phi-4-multimodal-instruct-commonvoice-zh-tw

This model is a fine - tuned version of microsoft/Phi-4-multimodal-instruct on the Common Voice 19.0 Taiwanese Mandarin dataset. It aims to provide high - quality automated speech recognition for Taiwanese Mandarin, with specific performance metrics for WER and CER.

✨ Features

Multimodal Adaptation: Based on Microsoft's Phi - 4 multimodal model, it is fine - tuned for speech recognition tasks.
Language Specificity: Optimized for Taiwanese Mandarin, recognizing its unique speech patterns and vocabulary.
Versatile Applications: Suitable for various speech - to - text scenarios, including transcription and subtitling.

📦 Installation

No installation steps are provided in the original document, so this section is skipped.

💻 Usage Examples

Basic Usage

import torch
from transformers import AutoProcessor, AutoModelForCausalLM
import librosa

AUDIO_PATH = "test.wav"

MODEL = "JacobLinCool/Phi-4-multimodal-instruct-commonvoice-zh-tw"
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
USE_FA = True

processor = AutoProcessor.from_pretrained(MODEL, trust_remote_code=True)

model = AutoModelForCausalLM.from_pretrained(
    MODEL,
    torch_dtype=torch.bfloat16 if USE_FA else torch.float32,
    _attn_implementation="flash_attention_2" if USE_FA else "sdpa",
    trust_remote_code=True,
).to(DEVICE)

audio, sr = librosa.load(AUDIO_PATH, sr=16000)

# Prepare the user message and generate the prompt
user_message = {
    "role": "user",
    "content": "<|audio_1|> Transcribe the audio clip into text.",
}
prompt = processor.tokenizer.apply_chat_template(
    [user_message], tokenize=False, add_generation_prompt=True
)

# Build the inputs for the model
inputs = processor(text=prompt, audios=[(audio, sr)], return_tensors="pt")
inputs = {k: v.to(model.device) if hasattr(v, "to") else v for k, v in inputs.items()}

# Generate transcription without gradients
with torch.no_grad():
    generated_ids = model.generate(
        **inputs,
        eos_token_id=processor.tokenizer.eos_token_id,
        max_new_tokens=64,
        do_sample=False,
    )

# Decode the generated token IDs into a human-readable transcription
transcription = processor.decode(
    generated_ids[0, inputs["input_ids"].shape[1] :],
    skip_special_tokens=True,
    clean_up_tokenization_spaces=False,
)

# Print the transcription
print(transcription)

Advanced Usage

There is no advanced usage example in the original document, so this part is not added.

📚 Documentation

Model description

Phi - 4 - multimodal - instruct - commonvoice - zh - tw is a multimodal language model fine - tuned for Automated Speech Recognition (ASR) of Taiwanese Mandarin (zh - TW). The base model is Microsoft's Phi - 4 - multimodal - instruct, which was further trained on speech transcription tasks. The model accepts audio input and produces Traditional Chinese text transcriptions. It has been specifically optimized to recognize Taiwanese Mandarin speech patterns and vocabulary.

Intended uses & limitations

This model is intended for:

Transcribing spoken Taiwanese Mandarin to text
Automated subtitling/captioning for zh - TW content
Speech - to - text applications requiring Taiwanese Mandarin support

Limitations:

Performance may vary with background noise, speaking speed, or accents
The model performs best with clear audio input
Specialized terminology or domain - specific vocabulary may have lower accuracy

Training and evaluation data

The model was fine - tuned on Common Voice 19.0 Taiwanese Mandarin dataset. Common Voice is a crowdsourced speech dataset containing contributions from volunteers who record themselves reading sentences in various languages. The evaluation was performed on the test split of the same dataset, consisting of 5,013 samples.

Training procedure

The model was trained using LoRA adapters focused on the speech recognition components of the base model, allowing for efficient fine - tuning while preserving the general capabilities of the underlying Phi - 4 model.

Prompt format

This model follows the prompt template from the original paper. For speech recognition tasks, the audio input is provided inline with a simple instruction:

<|user|>
<|audio_1|> Transcribe the audio clip into text.
<|assistant|>
[Transcription output in Traditional Chinese]
<|end|>

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 4e - 05
train_batch_size: 4
eval_batch_size: 8
seed: 42
gradient_accumulation_steps: 32
total_train_batch_size: 128
optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.95) and epsilon = 1e - 07 and optimizer_args = No additional optimizer arguments
lr_scheduler_type: linear
lr_scheduler_warmup_steps: 50
num_epochs: 2

Training results

The model achieved the following performance metrics on the test set:

Word Error Rate (WER): 31.18%
Character Error Rate (CER): 6.67%
Number of evaluation samples: 5,013

Framework versions

Transformers 4.49.0
Pytorch 2.4.1+cu124
Datasets 3.3.2
Tokenizers 0.21.1

🔧 Technical Details

The model uses LoRA adapters for fine - tuning on the speech recognition components of the base model. This approach allows for efficient parameter updates while maintaining the overall capabilities of the Phi - 4 model. The fine - tuning process focuses on adapting the model to the specific characteristics of Taiwanese Mandarin speech, including its unique vocabulary and speech patterns.

📄 License

The model is released under the MIT license.

Property	Details
Model Type	Fine - tuned multimodal language model for ASR of Taiwanese Mandarin
Training Data	Common Voice 19.0 Taiwanese Mandarin dataset

⚠️ Important Note

Performance may vary with background noise, speaking speed, or accents. The model performs best with clear audio input, and specialized terminology or domain - specific vocabulary may have lower accuracy.

💡 Usage Tip

For optimal results, ensure that the audio input is clear and free from excessive background noise. Also, be aware that the model's performance might be affected by different speaking styles and accents.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご