đ Kotoba-Whisper-Bilingual (v1.0)
Kotoba-Whisper-Bilingual is a collection of distilled Whisper models designed for Japanese and English automatic speech recognition and speech-to-text translation between the two languages.
faster-whisper weight, whisper.cpp weight
⨠Features
Kotoba-Whisper-Bilingual is a collection of distilled Whisper models trained for:
- Japanese ASR
- English ASR
- Speech-to-text translation (Japanese -> English)
- Speech-to-text translation (English -> Japanese)
It is developed through the collaboration between Asahi Ushio and Kotoba Technologies. Following the original work of distil-whisper (Robust Knowledge Distillation via Large-Scale Pseudo Labelling), we employ OpenAI's Whisper large-v3 as the teacher model for Japanese and English ASR. For speech-to-text translation, we translate the transcription into English and Japanese by an external LLM to obtain the training dataset.
We use ReazonSpeech for Japanese ASR and Japanese speech to English text translation, and Multilingual LibriSpeech for English ASR and English speech to Japanese text translation.
Kotoba-whisper-bilingual's loss objective consists of cross-entropy on both ASR and translation tasks, while the KL divergence loss is only for the ASR task. The student model consists of the full encoder of the teacher large-v3 model and a two - layer decoder initialized from the first and last layers of the large-v3 model.
As kotoba-whisper uses the same architecture as distil-whisper/distil-large-v3, it inherits the benefit of improved latency compared to openai/whisper-large-v3 (6.3x faster than large-v3, see the table below taken from distil-whisper/distil-large-v3).
đ Documentation
Evaluation
We compare our kotoba-whisper-bilingual with OpenAI whisper models, kotoba-whisper models, and cascaded models for translation. Worth noting that kotoba-whisper-bilingual is the only model that can do Japanese and English ASR and speech-to-text translation between Japanese and English. OpenAI whisper is not trained for English to Japanese speech-to-text translation, and other models are specific to the task (e.g., kotoba-whisper is for Japanese ASR and distil whisper is for English ASR only).
Speech2Text Translation (Japanese->English): WER (smaller is better)
Speech2Text Translation (English->Japanese): CER (smaller is better)
đ§ Technical Details
The model uses the following datasets:
Property |
Details |
Model Type |
Distilled Whisper models |
Training Data |
[Japanese ASR: japanese-asr/en_asr.mls, japanese-asr/ja_asr.reazon_speech_all] |
The loss objective consists of cross - entropy on both ASR and translation tasks, with KL divergence loss only for the ASR task. The student model has the full encoder of the teacher large - v3 model and a two - layer decoder initialized from the first and last layers of the large - v3 model.
đ License
This project is licensed under the Apache-2.0 license.