Wav2vec2 Large Xlsr Estonian
This is an Estonian automatic speech recognition (ASR) model fine-tuned from the facebook/wav2vec2-large-xlsr-53 model, trained using the Common Voice dataset.
Downloads 26
Release Time : 3/2/2022
Model Overview
This model is specifically designed for Estonian speech recognition tasks, capable of converting Estonian audio into text.
Model Features
High-precision speech recognition
Achieves a WER (Word Error Rate) of 33.93% on the Common Voice Estonian test set
Based on XLSR pre-trained model
Uses facebook/wav2vec2-large-xlsr-53 as the base model for fine-tuning
16kHz audio support
The model processes audio input with a 16kHz sampling rate
Model Capabilities
Estonian audio-to-text conversion
Automatic speech recognition
Use Cases
Speech transcription
Speech-to-text service
Converts Estonian speech content into editable text
Word Error Rate 33.93%
Voice assistants
Estonian voice command recognition
Used for building voice assistant systems that support Estonian
🚀 Wav2Vec2-Large-XLSR-53-Estonian
This project fine-tunes the facebook/wav2vec2-large-xlsr-53 model in Estonian using Common Voice. Ensure your speech input is sampled at 16kHz when using this model.
🚀 Quick Start
The model can be used directly (without a language model) as follows:
💻 Usage Examples
Basic Usage
# requirement packages
!pip install git+https://github.com/huggingface/datasets.git
!pip install git+https://github.com/huggingface/transformers.git
!pip install torchaudio
!pip install librosa
!pip install jiwer
import librosa
import torch
import torchaudio
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
from datasets import load_dataset
import numpy as np
import re
import string
import IPython.display as ipd
chars_to_ignore = [
",", "?", ".", "!", "-", ";", ":", '""', "%", "'", '"', "�",
"#", "!", "?", "«", "»", "(", ")", "؛", ",", "?", ".", "!", "-", ";", ":", '"',
"“", "%", "‘", "�", "–", "…", "_", "”", '“', '„'
]
chars_to_mapping = {
"\u200c": " ", "\u200d": " ", "\u200e": " ", "\u200f": " ", "\ufeff": " ",
}
def multiple_replace(text, chars_to_mapping):
pattern = "|".join(map(re.escape, chars_to_mapping.keys()))
return re.sub(pattern, lambda m: chars_to_mapping[m.group()], str(text))
def remove_special_characters(text, chars_to_ignore_regex):
text = re.sub(chars_to_ignore_regex, '', text).lower() + " "
return text
def normalizer(batch, chars_to_ignore, chars_to_mapping):
chars_to_ignore_regex = f"""[{"".join(chars_to_ignore)}]"""
text = batch["sentence"].lower().strip()
text = text.replace("\u0307", " ").strip()
text = multiple_replace(text, chars_to_mapping)
text = remove_special_characters(text, chars_to_ignore_regex)
batch["sentence"] = text
return batch
def speech_file_to_array_fn(batch):
speech_array, sampling_rate = torchaudio.load(batch["path"])
speech_array = speech_array.squeeze().numpy()
speech_array = librosa.resample(np.asarray(speech_array), sampling_rate, 16_000)
batch["speech"] = speech_array
return batch
def predict(batch):
features = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)
input_values = features.input_values.to(device)
attention_mask = features.attention_mask.to(device)
with torch.no_grad():
logits = model(input_values, attention_mask=attention_mask).logits
pred_ids = torch.argmax(logits, dim=-1)
batch["predicted"] = processor.batch_decode(pred_ids)[0]
return batch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
processor = Wav2Vec2Processor.from_pretrained("m3hrdadfi/wav2vec2-large-xlsr-estonian")
model = Wav2Vec2ForCTC.from_pretrained("m3hrdadfi/wav2vec2-large-xlsr-estonian").to(device)
dataset = load_dataset("common_voice", "et", split="test[:1%]")
dataset = dataset.map(
normalizer,
fn_kwargs={"chars_to_ignore": chars_to_ignore, "chars_to_mapping": chars_to_mapping},
remove_columns=list(set(dataset.column_names) - set(['sentence', 'path']))
)
dataset = dataset.map(speech_file_to_array_fn)
result = dataset.map(predict)
max_items = np.random.randint(0, len(result), 10).tolist()
for i in max_items:
reference, predicted = result["sentence"][i], result["predicted"][i]
print("reference:", reference)
print("predicted:", predicted)
print('---')
Output:
reference: õhulossid lagunevad ning ees ootab maapind
predicted: õhulassid lagunevad ning ees ootab maapind
---
reference: milliseks kiievisse pääsemise nimel võistlev muusik soome muusikamaastiku hetkeseisu hindab ning kas ta ka ennast sellel tulevikus tegutsemas näeb kuuled videost
predicted: milliseks gievisse pääsemise nimel võitlev muusiks soome muusikama aastiku hetke seisu hindab ning kas ta ennast selle tulevikus tegutsemast näeb kuulad videost
---
reference: näiteks kui pool seina on tehtud tekib tunne et tahaks tegelikult natuke teistsugust ja hakkame otsast peale
predicted: näiteks kui pool seine on tehtud tekib tunnetahaks tegelikult matuka teistsugust jahappanna otsast peane
---
reference: neuroesteetilised katsed näitavad et just nägude vaatlemine aktiveerib inimese aju esteetilist keskust
predicted: neuroaisteetiliselt katsed näitaval et just nägude vaatlemine aptiveerid inimese aju est eedilist keskust
---
reference: paljud inimesed kindlasti kadestavad teid kuid ei julge samamoodi vabalt võtta
predicted: paljud inimesed kindlasti kadestavadteid kuid ei julge sama moodi vabalt võtta
---
reference: parem on otsida pileteid inkognito veebi kaudu
predicted: parem on otsida pileteid ning kognitu veebikaudu
---
reference: ja vot siin ma jäin vaikseks
predicted: ja vat siisma ja invaikseks
---
reference: mida sa iseendale juubeli puhul soovid
predicted: mida saise endale jubeli puhul soovid
---
reference: kuumuse ja kõrge temperatuuri tõttu kuivas tühjadel karjamaadel rohi mis muutus kergesti süttivaks
predicted: kuumuse ja kõrge temperatuuri tõttu kuivast ühjadal karjamaadel rohi mis muutus kergesti süttivaks
---
reference: ilmselt on inimesi kelle jaoks on see hea lahendus
predicted: ilmselt on inimesi kelle jaoks on see hea lahendus
---
Evaluation
The model can be evaluated as follows on the Estonian test data of Common Voice.
import librosa
import torch
import torchaudio
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
from datasets import load_dataset, load_metric
import numpy as np
import re
import string
chars_to_ignore = [
",", "?", ".", "!", "-", ";", ":", '""', "%", "'", '"', "�",
"#", "!", "?", "«", "»", "(", ")", "؛", ",", "?", ".", "!", "-", ";", ":", '"',
"“", "%", "‘", "�", "–", "…", "_", "”", '“', '„'
]
chars_to_mapping = {
"\u200c": " ", "\u200d": " ", "\u200e": " ", "\u200f": " ", "\ufeff": " ",
}
def multiple_replace(text, chars_to_mapping):
pattern = "|".join(map(re.escape, chars_to_mapping.keys()))
return re.sub(pattern, lambda m: chars_to_mapping[m.group()], str(text))
def remove_special_characters(text, chars_to_ignore_regex):
text = re.sub(chars_to_ignore_regex, '', text).lower() + " "
return text
def normalizer(batch, chars_to_ignore, chars_to_mapping):
chars_to_ignore_regex = f"""[{"".join(chars_to_ignore)}]"""
text = batch["sentence"].lower().strip()
text = text.replace("\u0307", " ").strip()
text = multiple_replace(text, chars_to_mapping)
text = remove_special_characters(text, chars_to_ignore_regex)
batch["sentence"] = text
return batch
def speech_file_to_array_fn(batch):
speech_array, sampling_rate = torchaudio.load(batch["path"])
speech_array = speech_array.squeeze().numpy()
speech_array = librosa.resample(np.asarray(speech_array), sampling_rate, 16_000)
batch["speech"] = speech_array
return batch
def predict(batch):
features = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)
input_values = features.input_values.to(device)
attention_mask = features.attention_mask.to(device)
with torch.no_grad():
logits = model(input_values, attention_mask=attention_mask).logits
pred_ids = torch.argmax(logits, dim=-1)
batch["predicted"] = processor.batch_decode(pred_ids)[0]
return batch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
processor = Wav2Vec2Processor.from_pretrained("m3hrdadfi/wav2vec2-large-xlsr-estonian")
model = Wav2Vec2ForCTC.from_pretrained("m3hrdadfi/wav2vec2-large-xlsr-estonian").to(device)
dataset = load_dataset("common_voice", "et", split="test")
dataset = dataset.map(
normalizer,
fn_kwargs={"chars_to_ignore": chars_to_ignore, "chars_to_mapping": chars_to_mapping},
remove_columns=list(set(dataset.column_names) - set(['sentence', 'path']))
)
dataset = dataset.map(speech_file_to_array_fn)
result = dataset.map(predict)
wer = load_metric("wer")
print("WER: {:.2f}".format(100 * wer.compute(predictions=result["predicted"], references=result["sentence"])))
Test Result:
- WER: 33.93%
📚 Documentation
The Common Voice train
, validation
datasets were used for training.
You can see the training states here
The script used for training can be found here
📄 License
This project is licensed under the Apache-2.0 license.
Voice Activity Detection
MIT
Voice activity detection model based on pyannote.audio 2.1, used to identify speech activity segments in audio
Speech Recognition
V
pyannote
7.7M
181
Wav2vec2 Large Xlsr 53 Portuguese
Apache-2.0
This is a fine-tuned XLSR-53 large model for Portuguese speech recognition tasks, trained on the Common Voice 6.1 dataset, supporting Portuguese speech-to-text conversion.
Speech Recognition Other
W
jonatasgrosman
4.9M
32
Whisper Large V3
Apache-2.0
Whisper is an advanced automatic speech recognition (ASR) and speech translation model proposed by OpenAI, trained on over 5 million hours of labeled data, with strong cross-dataset and cross-domain generalization capabilities.
Speech Recognition Supports Multiple Languages
W
openai
4.6M
4,321
Whisper Large V3 Turbo
MIT
Whisper is a state-of-the-art automatic speech recognition (ASR) and speech translation model developed by OpenAI, trained on over 5 million hours of labeled data, demonstrating strong generalization capabilities in zero-shot settings.
Speech Recognition
Transformers Supports Multiple Languages

W
openai
4.0M
2,317
Wav2vec2 Large Xlsr 53 Russian
Apache-2.0
A Russian speech recognition model fine-tuned from facebook/wav2vec2-large-xlsr-53, supporting 16kHz sampled audio input
Speech Recognition Other
W
jonatasgrosman
3.9M
54
Wav2vec2 Large Xlsr 53 Chinese Zh Cn
Apache-2.0
A Chinese speech recognition model fine-tuned based on facebook/wav2vec2-large-xlsr-53, supporting 16kHz sampling rate audio input.
Speech Recognition Chinese
W
jonatasgrosman
3.8M
110
Wav2vec2 Large Xlsr 53 Dutch
Apache-2.0
A Dutch speech recognition model fine-tuned based on facebook/wav2vec2-large-xlsr-53, trained on the Common Voice and CSS10 datasets, supporting 16kHz audio input.
Speech Recognition Other
W
jonatasgrosman
3.0M
12
Wav2vec2 Large Xlsr 53 Japanese
Apache-2.0
Japanese speech recognition model fine-tuned from facebook/wav2vec2-large-xlsr-53, supporting 16kHz sampling rate audio input
Speech Recognition Japanese
W
jonatasgrosman
2.9M
33
Mms 300m 1130 Forced Aligner
A text-to-audio forced alignment tool based on Hugging Face pre-trained models, supporting multiple languages with high memory efficiency
Speech Recognition
Transformers Supports Multiple Languages

M
MahmoudAshraf
2.5M
50
Wav2vec2 Large Xlsr 53 Arabic
Apache-2.0
Arabic speech recognition model fine-tuned from facebook/wav2vec2-large-xlsr-53, trained on Common Voice and Arabic speech corpus
Speech Recognition Arabic
W
jonatasgrosman
2.3M
37
Featured Recommended AI Models