Model Overview
Model Features
Model Capabilities
Use Cases
🚀 Fine-tuned XLSR-53 large model for speech recognition in Finnish
This is a fine - tuned model based on facebook/wav2vec2-large-xlsr-53 for Finnish speech recognition, offering a practical solution for related tasks.
🚀 Quick Start
The fine - tuned facebook/wav2vec2-large-xlsr-53 model is trained on Finnish using the train and validation splits of Common Voice 6.1 and CSS10. When using this model, ensure that your speech input is sampled at 16kHz.
This model has been fine - tuned thanks to the GPU credits generously given by the OVHcloud :)
The script used for training can be found here: https://github.com/jonatasgrosman/wav2vec2-sprint
✨ Features
- Language Support: Specifically fine - tuned for Finnish speech recognition.
- Data Sources: Utilizes data from Common Voice 6.1 and CSS10 for training.
- GPU Credit: Thanks to OVHcloud for providing GPU credits for fine - tuning.
📦 Installation
No specific installation steps are provided in the original document.
💻 Usage Examples
Basic Usage
Using the HuggingSound library:
from huggingsound import SpeechRecognitionModel
model = SpeechRecognitionModel("jonatasgrosman/wav2vec2-large-xlsr-53-finnish")
audio_paths = ["/path/to/file.mp3", "/path/to/another_file.wav"]
transcriptions = model.transcribe(audio_paths)
Advanced Usage
Writing your own inference script:
import torch
import librosa
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
LANG_ID = "fi"
MODEL_ID = "jonatasgrosman/wav2vec2-large-xlsr-53-finnish"
SAMPLES = 5
test_dataset = load_dataset("common_voice", LANG_ID, split=f"test[:{SAMPLES}]")
processor = Wav2Vec2Processor.from_pretrained(MODEL_ID)
model = Wav2Vec2ForCTC.from_pretrained(MODEL_ID)
# Preprocessing the datasets.
# We need to read the audio files as arrays
def speech_file_to_array_fn(batch):
speech_array, sampling_rate = librosa.load(batch["path"], sr=16_000)
batch["speech"] = speech_array
batch["sentence"] = batch["sentence"].upper()
return batch
test_dataset = test_dataset.map(speech_file_to_array_fn)
inputs = processor(test_dataset["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)
with torch.no_grad():
logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits
predicted_ids = torch.argmax(logits, dim=-1)
predicted_sentences = processor.batch_decode(predicted_ids)
for i, predicted_sentence in enumerate(predicted_sentences):
print("-" * 100)
print("Reference:", test_dataset[i]["sentence"])
print("Prediction:", predicted_sentence)
Here is a comparison table of reference and prediction results:
Reference | Prediction |
---|---|
MYSTEERIMIES OLI OPPINUT MORAALINSA TARUISTA, ELOKUVISTA JA PELEISTÄ. | MYSTEERIMIES OLI OPPINUT MORALINSA TARUISTA ELOKUVISTA JA PELEISTÄ |
ÄÄNESTIN MIETINNÖN PUOLESTA! | ÄÄNESTIN MIETINNÖN PUOLESTA |
VAIN TUNTIA AIKAISEMMIN OLIMME MIEHENI KANSSA TUNTENEET SUURINTA ILOA. | PAIN TUNTIA AIKAISEMMIN OLIN MIEHENI KANSSA TUNTENEET SUURINTA ILAA |
ENSIMMÄISELLE MIEHELLE SAI KOLME LASTA. | ENSIMMÄISELLE MIEHELLE SAI KOLME LASTA |
ÄÄNESTIN MIETINNÖN PUOLESTA, SILLÄ POHJIMMILTAAN SIINÄ VASTUSTETAAN TÄTÄ SUUNTAUSTA. | ÄÄNESTIN MIETINNÖN PUOLESTA SILLÄ POHJIMMILTAAN SIINÄ VASTOTTETAAN TÄTÄ SUUNTAUSTA |
TÄHDENLENTOJENKO VARALTA MINÄ SEN OLISIN TÄNNE KUSKANNUT? | TÄHDEN LENTOJENKO VARALTA MINÄ SEN OLISIN TÄNNE KUSKANNUT |
SIITÄ SE TULEE. | SIITA SE TULEE |
NIIN, KUULUU KIROUS, JA KAUHEA KARJAISU. | NIIN KUULUU KIROUS JA KAUHEA KARJAISU |
ARKIT KUN OVAT NÄES ELEMENTTIRAKENTEISIA. | ARKIT KUN OVAT MÄISS' ELÄMÄTTEROKENTEISIÄ |
JÄIN ALUKSEN SISÄÄN, MUTTA KUULIN OVEN LÄPI, ETTÄ ULKOPUOLELLA ALKOI TAPAHTUA. | JAKALOKSEHÄN SISÄL MUTTA KUULIN OVENLAPI ETTÄ ULKA KUOLLALLA ALKOI TAPAHTUA |
📚 Documentation
Evaluation
The model can be evaluated on the Finnish test data of Common Voice as follows:
import torch
import re
import librosa
from datasets import load_dataset, load_metric
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
LANG_ID = "fi"
MODEL_ID = "jonatasgrosman/wav2vec2-large-xlsr-53-finnish"
DEVICE = "cuda"
CHARS_TO_IGNORE = [",", "?", "¿", ".", "!", "¡", ";", ";", ":", '""', "%", '"', "�", "ʿ", "·", "჻", "~", "՞",
"؟", "،", "।", "॥", "«", "»", "„", "“", "”", "「", "」", "‘", "’", "《", "》", "(", ")", "[", "]",
"{", "}", "=", "`", "_", "+", "<", ">", "…", "–", "°", "´", "ʾ", "‹", "›", "©", "®", "—", "→", "。",
"、", "﹂", "﹁", "‧", "~", "﹏", ",", "{", "}", "(", ")", "[", "]", "【", "】", "‥", "〽",
"『", "』", "〝", "〟", "⟨", "⟩", "〜", ":", "!", "?", "♪", "؛", "/", "\\", "º", "−", "^", "ʻ", "ˆ"]
test_dataset = load_dataset("common_voice", LANG_ID, split="test")
wer = load_metric("wer.py") # https://github.com/jonatasgrosman/wav2vec2-sprint/blob/main/wer.py
cer = load_metric("cer.py") # https://github.com/jonatasgrosman/wav2vec2-sprint/blob/main/cer.py
chars_to_ignore_regex = f"[{re.escape(''.join(CHARS_TO_IGNORE))}]"
processor = Wav2Vec2Processor.from_pretrained(MODEL_ID)
model = Wav2Vec2ForCTC.from_pretrained(MODEL_ID)
model.to(DEVICE)
# Preprocessing the datasets.
# We need to read the audio files as arrays
def speech_file_to_array_fn(batch):
with warnings.catch_warnings():
warnings.simplefilter("ignore")
speech_array, sampling_rate = librosa.load(batch["path"], sr=16_000)
batch["speech"] = speech_array
batch["sentence"] = re.sub(chars_to_ignore_regex, "", batch["sentence"]).upper()
return batch
test_dataset = test_dataset.map(speech_file_to_array_fn)
# Preprocessing the datasets.
# We need to read the audio files as arrays
def evaluate(batch):
inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)
with torch.no_grad():
logits = model(inputs.input_values.to(DEVICE), attention_mask=inputs.attention_mask.to(DEVICE)).logits
pred_ids = torch.argmax(logits, dim=-1)
batch["pred_strings"] = processor.batch_decode(pred_ids)
return batch
result = test_dataset.map(evaluate, batched=True, batch_size=8)
predictions = [x.upper() for x in result["pred_strings"]]
references = [x.upper() for x in result["sentence"]]
print(f"WER: {wer.compute(predictions=predictions, references=references, chunk_size=1000) * 100}")
print(f"CER: {cer.compute(predictions=predictions, references=references, chunk_size=1000) * 100}")
Test Result: In the table below, the Word Error Rate (WER) and the Character Error Rate (CER) of the model are reported. The evaluation script described above was also run on other models on 2021 - 04 - 21. Note that the table below may show different results from those already reported, which may be caused by some specificity of the other evaluation scripts used.
Model | WER | CER |
---|---|---|
aapot/wav2vec2-large-xlsr-53-finnish | 32.51% | 5.34% |
Tommi/wav2vec2-large-xlsr-53-finnish | 35.22% | 5.81% |
vasilis/wav2vec2-large-xlsr-53-finnish | 38.24% | 6.49% |
jonatasgrosman/wav2vec2-large-xlsr-53-finnish | 41.60% | 8.23% |
birgermoell/wav2vec2-large-xlsr-finnish | 53.51% | 9.18% |
📄 License
This model is licensed under the apache - 2.0 license.
📚 Citation
If you want to cite this model, you can use the following BibTeX entry:
@misc{grosman2021xlsr53-large-finnish,
title={Fine-tuned {XLSR}-53 large model for speech recognition in {F}innish},
author={Grosman, Jonatas},
howpublished={\url{https://huggingface.co/jonatasgrosman/wav2vec2-large-xlsr-53-finnish}},
year={2021}
}
📋 Metadata
Property | Details |
---|---|
Language | Finnish |
Datasets | common_voice |
Metrics | wer, cer |
Tags | audio, automatic - speech - recognition, speech, xlsr - fine - tuning - week |
Model Name | XLSR Wav2Vec2 Finnish by Jonatas Grosman |
License | apache - 2.0 |

