The open-source Estonian speech recognition model XLS-R-300M-ET - Trained with over 800 hours of data and highly practical

Xls R 300m Et

Developed by TalTechNLP

An Estonian automatic speech recognition model fine-tuned based on facebook/wav2vec2-xls-r-300m, trained with approximately 800 hours of diverse data

Speech Recognition

Transformers

Other#Estonian speech recognition #Broadcast speech optimization #Low CER

Downloads 58

Release Time : 3/2/2022

Model Overview

This is a general-purpose Estonian ASR model, primarily used for speech recognition in scenarios such as broadcast dialogues, interviews, and lectures

Model Features

Diverse training data

Trained with approximately 800 hours of diverse Estonian data, including broadcast speech, spontaneous speech, elderly speech, and various other types

Excellent performance

Achieves WER of 12.5-13.4% and CER of 2.7-3.0% on test sets like Common Voice, demonstrating outstanding performance

Estonian-focused optimization

Specially optimized for Estonian, delivering better recognition performance compared to general multilingual models

Model Capabilities

Estonian speech recognition

Broadcast speech transcription

Lecture content transcription

Use Cases

Media content processing

Broadcast program transcription

Transcribing broadcast dialogues, interviews, and other content into text

WER 6.1-7.9%

Educational applications

Lecture content recording

Automatically transcribing lectures and speeches into text

🚀 XLS-R-300m-ET

This is a XLS-R-300M model finetuned on around 800 hours of diverse Estonian data, aiming to provide high - quality automatic speech recognition for Estonian.

🚀 Quick Start

TODO

✨ Features

This is a general - purpose Estonian ASR model trained in the Lab of Language Technology at TalTech.
It consists of only the CTC - based end - to - end model, and no language model is currently provided.

📚 Documentation

Intended uses & limitations

This model is intended for general - purpose speech recognition, such as broadcast conversations, interviews, talks, etc.

Limitations and bias

Since this model was trained on mostly broadcast speech and texts from the web, it might have problems correctly decoding the following:

Speech containing technical and other domain - specific terms
Children's speech
Non - native speech
Speech recorded under very noisy conditions or with a microphone far from the speaker
Very spontaneous and overlapping speech

Training data

Property	Details
Acoustic training data - Broadcast speech	591h
Acoustic training data - Spontaneous speech	53h
Acoustic training data - Elderly speech corpus	53h
Acoustic training data - Talks, lectures	49h
Acoustic training data - Parliament speeches	31h
Acoustic training data - Total	761h

Training procedure

Finetuned using Fairseq.

Evaluation results

WER

Dataset	WER
jutusaated.devset	7.9
jutusaated.testset	6.1
Common Voice 6.1	12.5
Common Voice 8.0	13.4

Model Index

Property	Details
Model Name	xls - r - 300m - et
Task Name	Automatic Speech Recognition
Task Type	automatic - speech - recognition
Dataset Name 1	Common Voice
Dataset Type 1	common_voice
Dataset Args 1	et
Metrics 1 - Test WER	12.520395591222402
Metrics 1 - Test CER	2.7091152438624897
Dataset Name 2	Common Voice 8
Dataset Type 2	mozilla - foundation/common_voice_8_0
Dataset Args 2	et
Metrics 2 - Test WER	13.38447882323104
Metrics 2 - Test CER	2.9816686199500255

📄 License

This model is licensed under CC - BY - 4.0.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご