Phi-4-multimodal-instruct-ko-asr Open Source Model - Korean Speech Recognition and Translation, Excellent Performance on Datasets

Phi 4 Multimodal Instruct Ko Asr

Developed by junnei

A Korean automatic speech recognition (ASR) and speech translation (AST) model fine-tuned based on microsoft/Phi-4-multimodal-instruct, demonstrating excellent performance on the zeroth-korean and fleurs datasets.

Text-to-Audio

Transformers

Korean#Korean speech recognition #Low character error rate #Multimodal instruction fine-tuning

Downloads 354

Release Time : 3/5/2025

Model Overview

This model focuses on Korean speech recognition and translation tasks, improving recognition accuracy and translation quality in Korean environments through fine-tuning.

Model Features

High-performance Korean recognition

Achieves a character error rate (CER) of 1.316 and a word error rate (WER) of 2.951 on the zeroth-korean test set.

Multi-task support

Supports both automatic speech recognition (ASR) and speech translation (AST) tasks simultaneously.

Optimized training

Utilized H100 GPU for 960 steps of targeted training, significantly enhancing Korean language processing capabilities.

Model Capabilities

Korean speech recognition

Korean-English speech translation

English-Korean speech translation

Use Cases

Speech transcription

Korean meeting minutes

Real-time transcription of Korean meeting recordings into text

Achieves a character error rate of only 1.316% on the zeroth test set.

Speech translation

Korean-English real-time translation

Real-time translation of Korean speech into English text

Achieves a BLEU score of 67.659 on the fleurs Korean test set.

🚀 Phi-4-multimodal-instruct-ko-asr

This model is fine-tuned for Korean automatic speech recognition and translation, achieving high performance on multiple datasets.

🚀 Quick Start

This model is fine-tuned from microsoft/Phi-4-multimodal-instruct on Bingsu/zeroth-korean, google/flerus in 5 epochs.

This model is trained 960 steps on datasets for Korean Audio Speech Recognition on H100.

After that, we continue training with CoVoST2 Dataset / CoVoST2-Ko for AST.

AST Finetuned model is Here : Phi-4-multimodal-instruct-ko-speech

📚 Documentation

Model Information

Property	Details
Library Name	transformers
Datasets	Bingsu/zeroth-korean, google/fleurs
Languages	Korean
Metrics	CER, WER, BLEU
Base Model	microsoft/Phi-4-multimodal-instruct
Pipeline Tag	automatic-speech-recognition

Model Index

Name: Phi-4-multimodal-instruct-ko-asr
Results:
- Task: Automatic Speech Recognition
- Dataset: Bingsu/zeroth_korean (zeroth-korean-test)
- Metrics:
  - BLEU: 94.837
  - CER: 1.316
  - WER: 2.951
- Task: Automatic Speech Recognition
- Dataset: google/flerus (flerus-ko-test)
- Metrics:
  - BLEU: 67.659
  - CER: 7.951
  - WER: 18.313

🔧 Technical Details

This model is fine-tuned from microsoft/Phi-4-multimodal-instruct on Korean datasets for automatic speech recognition and translation. It is trained for 5 epochs on Bingsu/zeroth-korean and google/flerus, and then further trained on CoVoST2 Dataset / CoVoST2-Ko for AST.

📄 Evaluation

Evaluation was done on the following datasets:

ASR (Automatic Speech Recognition): Evaluated with CER (Character Error Rate) on zeroth-test set (457 samples).
AST (Automatic Speech Translation): Evaluated with BLEU score on fleurs ko <-> en speech translation result (270 samples).

Script is retrieved from here.

Compared to Phi-4-mm-inst-zeroth-kor and Phi-4-multimodal-finetune-ko-speech, ASR is significantly improved.

Model	zeroth-CER	zeroth-WER	fleurs-ko_en-BLEU	fleurs-ko_en-cot-BLEU	fleurs-en_ko-BLEU	fleurs-en_ko-cot-BLEU
original	198.32	-	5.63	2.42	6.86	4.17
daekeun-ml/Phi-4-multimodal-finetune-ko-speech	1.61	3.54	7.67	8.38	12.31	9.69
seastar105/Phi-4-mm-inst-zeroth-kor	7.02	-	7.07	9.19	13.08	9.35
ASR finetune(this model)	1.31	2.95	7.46	6.24	12.15	8.91
+ 1 epoch finetune with Covost-Ko	3.88	-	8.07	10.09	18.82	15.41
AST finetuned model	1.77	2.99	8.01	9.09	17.09	11.82

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご