đ Whisper-large-et
This is a finetuned Whisper-large-v2 model on Estonian data, designed for general - purpose speech recognition.
đ Quick Start
Recommended: use faster-whisper.
For example:
-
Convert the HF model to CT2 format:
ct2-transformers-converter --model TalTechNLP/whisper-large-et --output_dir whisper-large-et.ct2 --copy_files tokenizer.json --quantization float16
-
Decode:
whisper-ctranslate2 --model_directory whisper-large-et.ct2 --task transcribe --language et --beam_size 5 some_file.mp3
⨠Features
- This is a general - purpose Estonian ASR model trained in the Lab of Language Technology at TalTech.
- It is finetuned on around 1200 hours of diverse Estonian data.
- The 2023 - 10 - 03 version of the model is trained on long segments, well - suited for "end - to - end" transcription of long speech recordings.
đĻ Installation
There is no specific installation process described other than the conversion and decoding steps in the "Quick Start" section.
đģ Usage Examples
Basic Usage
ct2-transformers-converter --model TalTechNLP/whisper-large-et --output_dir whisper-large-et.ct2 --copy_files tokenizer.json --quantization float16
whisper-ctranslate2 --model_directory whisper-large-et.ct2 --task transcribe --language et --beam_size 5 some_file.mp3
đ Documentation
Model description
This is a general - purpose Estonian ASR model trained in the Lab of Language Technology at TalTech.
Intended uses & limitations
This model is intended for general - purpose speech recognition, such as broadcast conversations, interviews, talks, etc.
Limitations and bias
Since this model was trained on mostly broadcast speech and texts from the web, it might have problems correctly decoding the following:
- Speech containing technical and other domain - specific terms
- Children's speech
- Non - native speech
- Speech recorded under very noisy conditions or with a microphone far from the speaker
- Very spontaneous and overlapping speech
Training data
Acoustic training data:
Type |
Amount (h) |
Broadcast speech |
991 |
Spontaneous speech |
53 |
Elderly speech corpus |
53 |
Talks, lectures |
49 |
Parliament speeches |
31 |
Total |
1161 |
Training procedure
Finetuned using Espnet, and then comverted to transformers format using this script.
Finetuning procedure is similar to this model.
Finetuning was done for 3 epochs, with model averaging at the end of training.
Update: 2023 - 10 - 03 version of the model is trained on long segments (like the original Whisper model) and
is therefore especially well suited to be used e.g. with faster-whisper to
transcribe long speech recordings "end - to - end" (i.e., without any prior segmentation).
Evaluation results
WER
WER results below are obtained using greedy decoding (i.e., beam size 1).
Dataset |
WER |
Common Voice 8.0 |
11.3 |
Common Voice 11.0 |
12.0 |
đ License
This model is licensed under CC BY 4.0.