đ Phi-4-multimodal-instruct-commonvoice-zh-tw
This model is a fine - tuned version of microsoft/Phi-4-multimodal-instruct on the Common Voice 19.0 Taiwanese Mandarin dataset. It aims to provide high - quality automated speech recognition for Taiwanese Mandarin, with specific performance metrics for WER and CER.
⨠Features
- Multimodal Adaptation: Based on Microsoft's Phi - 4 multimodal model, it is fine - tuned for speech recognition tasks.
- Language Specificity: Optimized for Taiwanese Mandarin, recognizing its unique speech patterns and vocabulary.
- Versatile Applications: Suitable for various speech - to - text scenarios, including transcription and subtitling.
đĻ Installation
No installation steps are provided in the original document, so this section is skipped.
đģ Usage Examples
Basic Usage
import torch
from transformers import AutoProcessor, AutoModelForCausalLM
import librosa
AUDIO_PATH = "test.wav"
MODEL = "JacobLinCool/Phi-4-multimodal-instruct-commonvoice-zh-tw"
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
USE_FA = True
processor = AutoProcessor.from_pretrained(MODEL, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
MODEL,
torch_dtype=torch.bfloat16 if USE_FA else torch.float32,
_attn_implementation="flash_attention_2" if USE_FA else "sdpa",
trust_remote_code=True,
).to(DEVICE)
audio, sr = librosa.load(AUDIO_PATH, sr=16000)
user_message = {
"role": "user",
"content": "<|audio_1|> Transcribe the audio clip into text.",
}
prompt = processor.tokenizer.apply_chat_template(
[user_message], tokenize=False, add_generation_prompt=True
)
inputs = processor(text=prompt, audios=[(audio, sr)], return_tensors="pt")
inputs = {k: v.to(model.device) if hasattr(v, "to") else v for k, v in inputs.items()}
with torch.no_grad():
generated_ids = model.generate(
**inputs,
eos_token_id=processor.tokenizer.eos_token_id,
max_new_tokens=64,
do_sample=False,
)
transcription = processor.decode(
generated_ids[0, inputs["input_ids"].shape[1] :],
skip_special_tokens=True,
clean_up_tokenization_spaces=False,
)
print(transcription)
Advanced Usage
There is no advanced usage example in the original document, so this part is not added.
đ Documentation
Model description
Phi - 4 - multimodal - instruct - commonvoice - zh - tw is a multimodal language model fine - tuned for Automated Speech Recognition (ASR) of Taiwanese Mandarin (zh - TW). The base model is Microsoft's Phi - 4 - multimodal - instruct, which was further trained on speech transcription tasks. The model accepts audio input and produces Traditional Chinese text transcriptions. It has been specifically optimized to recognize Taiwanese Mandarin speech patterns and vocabulary.
Intended uses & limitations
This model is intended for:
- Transcribing spoken Taiwanese Mandarin to text
- Automated subtitling/captioning for zh - TW content
- Speech - to - text applications requiring Taiwanese Mandarin support
Limitations:
- Performance may vary with background noise, speaking speed, or accents
- The model performs best with clear audio input
- Specialized terminology or domain - specific vocabulary may have lower accuracy
Training and evaluation data
The model was fine - tuned on Common Voice 19.0 Taiwanese Mandarin dataset. Common Voice is a crowdsourced speech dataset containing contributions from volunteers who record themselves reading sentences in various languages. The evaluation was performed on the test split of the same dataset, consisting of 5,013 samples.
Training procedure
The model was trained using LoRA adapters focused on the speech recognition components of the base model, allowing for efficient fine - tuning while preserving the general capabilities of the underlying Phi - 4 model.
Prompt format
This model follows the prompt template from the original paper. For speech recognition tasks, the audio input is provided inline with a simple instruction:
<|user|>
<|audio_1|> Transcribe the audio clip into text.
<|assistant|>
[Transcription output in Traditional Chinese]
<|end|>
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 4e - 05
- train_batch_size: 4
- eval_batch_size: 8
- seed: 42
- gradient_accumulation_steps: 32
- total_train_batch_size: 128
- optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.95) and epsilon = 1e - 07 and optimizer_args = No additional optimizer arguments
- lr_scheduler_type: linear
- lr_scheduler_warmup_steps: 50
- num_epochs: 2
Training results
The model achieved the following performance metrics on the test set:
- Word Error Rate (WER): 31.18%
- Character Error Rate (CER): 6.67%
- Number of evaluation samples: 5,013
Framework versions
- Transformers 4.49.0
- Pytorch 2.4.1+cu124
- Datasets 3.3.2
- Tokenizers 0.21.1
đ§ Technical Details
The model uses LoRA adapters for fine - tuning on the speech recognition components of the base model. This approach allows for efficient parameter updates while maintaining the overall capabilities of the Phi - 4 model. The fine - tuning process focuses on adapting the model to the specific characteristics of Taiwanese Mandarin speech, including its unique vocabulary and speech patterns.
đ License
The model is released under the MIT license.
Property |
Details |
Model Type |
Fine - tuned multimodal language model for ASR of Taiwanese Mandarin |
Training Data |
Common Voice 19.0 Taiwanese Mandarin dataset |
â ī¸ Important Note
Performance may vary with background noise, speaking speed, or accents. The model performs best with clear audio input, and specialized terminology or domain - specific vocabulary may have lower accuracy.
đĄ Usage Tip
For optimal results, ensure that the audio input is clear and free from excessive background noise. Also, be aware that the model's performance might be affected by different speaking styles and accents.