wav2vec2-large-xlsr-53-finnish open-source Finnish speech recognition model

Wav2vec2 Large Xlsr 53 Finnish

Developed by jonatasgrosman

Finnish speech recognition model fine-tuned based on XLSR-53 large model, supports 16kHz audio input

Speech Recognition OtherOpen Source License:Apache-2.0 #Finnish speech recognition #XLSR-53 fine-tuning #Multi-source data training

Downloads 73.11k

Release Time : 3/2/2022

Model Overview

Automatic speech recognition model optimized for Finnish, trained on Common Voice and CSS10 datasets

Model Features

Finnish Optimization

Specially fine-tuned for Finnish speech characteristics

XLSR-53 Architecture

Based on powerful cross-lingual speech representation model architecture

Multi-dataset Training

Joint training using Common Voice 6.1 and CSS10 datasets

Model Capabilities

Finnish speech-to-text

16kHz audio processing

Use Cases

Speech Transcription

Finnish Speech Transcription

Convert Finnish speech to text

Test WER 41.60%, CER 8.23%

🚀 Fine-tuned XLSR-53 large model for speech recognition in Finnish

This is a fine - tuned model based on facebook/wav2vec2-large-xlsr-53 for Finnish speech recognition, offering a practical solution for related tasks.

🚀 Quick Start

The fine - tuned facebook/wav2vec2-large-xlsr-53 model is trained on Finnish using the train and validation splits of Common Voice 6.1 and CSS10. When using this model, ensure that your speech input is sampled at 16kHz.

This model has been fine - tuned thanks to the GPU credits generously given by the OVHcloud :)

The script used for training can be found here: https://github.com/jonatasgrosman/wav2vec2-sprint

✨ Features

Language Support: Specifically fine - tuned for Finnish speech recognition.
Data Sources: Utilizes data from Common Voice 6.1 and CSS10 for training.
GPU Credit: Thanks to OVHcloud for providing GPU credits for fine - tuning.

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

Using the HuggingSound library:

from huggingsound import SpeechRecognitionModel

model = SpeechRecognitionModel("jonatasgrosman/wav2vec2-large-xlsr-53-finnish")
audio_paths = ["/path/to/file.mp3", "/path/to/another_file.wav"]

transcriptions = model.transcribe(audio_paths)

Advanced Usage

Writing your own inference script:

import torch
import librosa
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

LANG_ID = "fi"
MODEL_ID = "jonatasgrosman/wav2vec2-large-xlsr-53-finnish"
SAMPLES = 5

test_dataset = load_dataset("common_voice", LANG_ID, split=f"test[:{SAMPLES}]")

processor = Wav2Vec2Processor.from_pretrained(MODEL_ID)
model = Wav2Vec2ForCTC.from_pretrained(MODEL_ID)

# Preprocessing the datasets.
# We need to read the audio files as arrays
def speech_file_to_array_fn(batch):
    speech_array, sampling_rate = librosa.load(batch["path"], sr=16_000)
    batch["speech"] = speech_array
    batch["sentence"] = batch["sentence"].upper()
    return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)
inputs = processor(test_dataset["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)

with torch.no_grad():
    logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits

predicted_ids = torch.argmax(logits, dim=-1)
predicted_sentences = processor.batch_decode(predicted_ids)

for i, predicted_sentence in enumerate(predicted_sentences):
    print("-" * 100)
    print("Reference:", test_dataset[i]["sentence"])
    print("Prediction:", predicted_sentence)

Here is a comparison table of reference and prediction results:

Reference	Prediction
MYSTEERIMIES OLI OPPINUT MORAALINSA TARUISTA, ELOKUVISTA JA PELEISTÄ.	MYSTEERIMIES OLI OPPINUT MORALINSA TARUISTA ELOKUVISTA JA PELEISTÄ
ÄÄNESTIN MIETINNÖN PUOLESTA!	ÄÄNESTIN MIETINNÖN PUOLESTA
VAIN TUNTIA AIKAISEMMIN OLIMME MIEHENI KANSSA TUNTENEET SUURINTA ILOA.	PAIN TUNTIA AIKAISEMMIN OLIN MIEHENI KANSSA TUNTENEET SUURINTA ILAA
ENSIMMÄISELLE MIEHELLE SAI KOLME LASTA.	ENSIMMÄISELLE MIEHELLE SAI KOLME LASTA
ÄÄNESTIN MIETINNÖN PUOLESTA, SILLÄ POHJIMMILTAAN SIINÄ VASTUSTETAAN TÄTÄ SUUNTAUSTA.	ÄÄNESTIN MIETINNÖN PUOLESTA SILLÄ POHJIMMILTAAN SIINÄ VASTOTTETAAN TÄTÄ SUUNTAUSTA
TÄHDENLENTOJENKO VARALTA MINÄ SEN OLISIN TÄNNE KUSKANNUT?	TÄHDEN LENTOJENKO VARALTA MINÄ SEN OLISIN TÄNNE KUSKANNUT
SIITÄ SE TULEE.	SIITA SE TULEE
NIIN, KUULUU KIROUS, JA KAUHEA KARJAISU.	NIIN KUULUU KIROUS JA KAUHEA KARJAISU
ARKIT KUN OVAT NÄES ELEMENTTIRAKENTEISIA.	ARKIT KUN OVAT MÄISS' ELÄMÄTTEROKENTEISIÄ
JÄIN ALUKSEN SISÄÄN, MUTTA KUULIN OVEN LÄPI, ETTÄ ULKOPUOLELLA ALKOI TAPAHTUA.	JAKALOKSEHÄN SISÄL MUTTA KUULIN OVENLAPI ETTÄ ULKA KUOLLALLA ALKOI TAPAHTUA

📚 Documentation

Evaluation

The model can be evaluated on the Finnish test data of Common Voice as follows:

import torch
import re
import librosa
from datasets import load_dataset, load_metric
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

LANG_ID = "fi"
MODEL_ID = "jonatasgrosman/wav2vec2-large-xlsr-53-finnish"
DEVICE = "cuda"

CHARS_TO_IGNORE = [",", "?", "¿", ".", "!", "¡", ";", "；", ":", '""', "%", '"', "�", "ʿ", "·", "჻", "~", "՞",
                   "؟", "،", "।", "॥", "«", "»", "„", "“", "”", "「", "」", "‘", "’", "《", "》", "(", ")", "[", "]",
                   "{", "}", "=", "`", "_", "+", "<", ">", "…", "–", "°", "´", "ʾ", "‹", "›", "©", "®", "—", "→", "。",
                   "、", "﹂", "﹁", "‧", "～", "﹏", "，", "｛", "｝", "（", "）", "［", "］", "【", "】", "‥", "〽",
                   "『", "』", "〝", "〟", "⟨", "⟩", "〜", "：", "！", "？", "♪", "؛", "/", "\\", "º", "−", "^", "ʻ", "ˆ"]

test_dataset = load_dataset("common_voice", LANG_ID, split="test")

wer = load_metric("wer.py") # https://github.com/jonatasgrosman/wav2vec2-sprint/blob/main/wer.py
cer = load_metric("cer.py") # https://github.com/jonatasgrosman/wav2vec2-sprint/blob/main/cer.py

chars_to_ignore_regex = f"[{re.escape(''.join(CHARS_TO_IGNORE))}]"

processor = Wav2Vec2Processor.from_pretrained(MODEL_ID)
model = Wav2Vec2ForCTC.from_pretrained(MODEL_ID)
model.to(DEVICE)

# Preprocessing the datasets.
# We need to read the audio files as arrays
def speech_file_to_array_fn(batch):
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")
        speech_array, sampling_rate = librosa.load(batch["path"], sr=16_000)
    batch["speech"] = speech_array
    batch["sentence"] = re.sub(chars_to_ignore_regex, "", batch["sentence"]).upper()
    return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)

# Preprocessing the datasets.
# We need to read the audio files as arrays
def evaluate(batch):
    inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)

    with torch.no_grad():
        logits = model(inputs.input_values.to(DEVICE), attention_mask=inputs.attention_mask.to(DEVICE)).logits

    pred_ids = torch.argmax(logits, dim=-1)
    batch["pred_strings"] = processor.batch_decode(pred_ids)
    return batch

result = test_dataset.map(evaluate, batched=True, batch_size=8)

predictions = [x.upper() for x in result["pred_strings"]]
references = [x.upper() for x in result["sentence"]]

print(f"WER: {wer.compute(predictions=predictions, references=references, chunk_size=1000) * 100}")
print(f"CER: {cer.compute(predictions=predictions, references=references, chunk_size=1000) * 100}")

Test Result: In the table below, the Word Error Rate (WER) and the Character Error Rate (CER) of the model are reported. The evaluation script described above was also run on other models on 2021 - 04 - 21. Note that the table below may show different results from those already reported, which may be caused by some specificity of the other evaluation scripts used.

Model	WER	CER
aapot/wav2vec2-large-xlsr-53-finnish	32.51%	5.34%
Tommi/wav2vec2-large-xlsr-53-finnish	35.22%	5.81%
vasilis/wav2vec2-large-xlsr-53-finnish	38.24%	6.49%
jonatasgrosman/wav2vec2-large-xlsr-53-finnish	41.60%	8.23%
birgermoell/wav2vec2-large-xlsr-finnish	53.51%	9.18%

📄 License

This model is licensed under the apache - 2.0 license.

📚 Citation

If you want to cite this model, you can use the following BibTeX entry:

@misc{grosman2021xlsr53-large-finnish,
  title={Fine-tuned {XLSR}-53 large model for speech recognition in {F}innish},
  author={Grosman, Jonatas},
  howpublished={\url{https://huggingface.co/jonatasgrosman/wav2vec2-large-xlsr-53-finnish}},
  year={2021}
}

📋 Metadata

Property	Details
Language	Finnish
Datasets	common_voice
Metrics	wer, cer
Tags	audio, automatic - speech - recognition, speech, xlsr - fine - tuning - week
Model Name	XLSR Wav2Vec2 Finnish by Jonatas Grosman
License	apache - 2.0

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご