🚀 General-purpose Latvian ASR model
A fine - tuned model for Latvian automatic speech recognition, leveraging the power of whisper - large - v3.
This is a fine - tuned [whisper - large - v3](https://huggingface.co/openai/whisper - large - v3) model for Latvian, trained by AiLab.lv using two general - purpose speech datasets: the Latvian part of Common Voice 19.0, and the latest version of the Latvian broadcast dataset [LATE - Media](https://korpuss.lv/id/LATE - mediji).
This version of the model supersedes the previous [whisper - large - v3 - lv - late - cv17](https://huggingface.co/AiLab - IMCS - UL/whisper - large - v3 - lv - late - cv17) model.
We also provide 4 - bit, 5 - bit and 8 - bit quantized versions of the model in the GGML format for the use with whisper.cpp, as well as an 8 - bit quantized version for the use with CTranslate2.
✨ Features
- Fine - tuned for Latvian language on multiple datasets.
- Supersedes the previous version of the model.
- Provides quantized versions for different use cases.
📚 Documentation
Training
Fine - tuning was done using the Hugging Face Transformers library with a modified [seq2seq script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/speech - recognition#sequence - to - sequence).
Property |
Details |
Training data |
Latvian Common Voice 19.0 train set (the [VW split](https://analyzer.cv - toolbox.web.tr)) and LATE - Media 2.0 train set |
Total training hours |
282.4 |
Training data |
Hours |
Latvian Common Voice 19.0 train set (the [VW split](https://analyzer.cv - toolbox.web.tr)) |
212.6 |
LATE - Media 2.0 train set |
69.8 |
Total |
282.4 |
Evaluation
Property |
Details |
Testing data |
Latvian Common Voice 19.0 test set (VW) and LATE - Media 1.0 test set |
Evaluation metrics |
Word Error Rate (WER) and Character Error Rate (CER) |
Testing data |
WER |
CER |
Latvian Common Voice 19.0 test set (VW) - formatted |
4.8 |
1.6 |
Latvian Common Voice 19.0 test set (VW) - normalized |
3.2 |
1.0 |
LATE - Media 1.0 test set - formatted |
19.2 |
7.6 |
LATE - Media 1.0 test set - normalized |
12.8 |
5.3 |
The Latvian CV 19.0 test set is available [here](https://analyzer.cv - toolbox.web.tr).
The LATE - Media 1.0 test set is available here.
Citation
Please cite this paper if you use this model in your research:
@inproceedings{dargis-etal-2024-balsutalka-lv,
author = {Dargis, Roberts and Znotins, Arturs and Auzina, Ilze and Saulite, Baiba and Reinsone, Sanita and Dejus, Raivis and Klavinska, Antra and Gruzitis, Normunds},
title = {{BalsuTalka.lv - Boosting the Common Voice Corpus for Low - Resource Languages}},
booktitle = {Proceedings of the Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC - COLING)},
publisher = {ELRA and ICCL},
year = {2024},
pages = {2080--2085},
url = {https://aclanthology.org/2024.lrec - main.187}
}
Acknowledgements
This work was supported by the EU Recovery and Resilience Facility project Language Technology Initiative (2.3.1.1.i.0/1/22/I/CFLA/002) in synergy with the State Research Programme project [LATE](https://www.digitalhumanities.lv/projekti/vpp - late/) (VPP - LETONIKA - 2021/1 - 0006).
📄 License
This project is licensed under the Apache 2.0 license.