Wav2vec2-large-xlsr-53-finnish Open Source Model - Freely Implement 16kHz Finnish Speech Automatic Recognition

Wav2vec2 Large Xlsr 53 Finnish

Developed by vasilis

A Finnish automatic speech recognition model fine-tuned from facebook/wav2vec2-large-xlsr-53, supporting 16kHz sampled audio input

Speech Recognition

Transformers

OtherOpen Source License:Apache-2.0 #Finnish speech recognition #Low character error rate #Multi-dataset fine-tuning

Downloads 27

Release Time : 3/2/2022

Model Overview

This model is a Finnish automatic speech recognition (ASR) model based on the Wav2Vec2 architecture, fine-tuned using the Common Voice and CSS10 Finnish datasets, and can be directly used for speech-to-text tasks

Model Features

Multi-dataset fine-tuning

Trained simultaneously using the Common Voice and CSS10 Finnish datasets to improve model adaptability

No language model required

Can be used directly without additional language model support

16kHz sampling rate support

Specifically optimized to support 16kHz sampled audio input

Model Capabilities

Finnish speech recognition

Speech-to-text

Automatic speech transcription

Use Cases

Speech transcription

Finnish speech-to-text

Convert Finnish speech content into text format

Test WER 38.34%, CER 6.55%

Voice assistant

Finnish voice command recognition

Used for voice command recognition in Finnish voice assistants or smart home systems

🚀 Wav2Vec2-Large-XLSR-53-finnish

This model is a fine - tuned version of facebook/wav2vec2-large-xlsr-53 on Finnish, leveraging the Common Voice and CSS10 finnish: Single Speaker Speech Dataset. Ensure that your speech input is sampled at 16kHz when using this model.

🚀 Quick Start

When using this model, make sure that your speech input is sampled at 16kHz.

✨ Features

Fine - Tuned Model: Based on facebook/wav2vec2-large-xlsr-53, fine - tuned on Finnish datasets.
Multiple Datasets: Utilizes Common Voice and CSS10 finnish: Single Speaker Speech Dataset for training.
Metrics: Evaluated using WER (Word Error Rate) and CER (Character Error Rate).

📦 Installation

No specific installation steps are provided in the original document, so this section is skipped.

💻 Usage Examples

Basic Usage

import torch
import torchaudio
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

test_dataset = load_dataset("common_voice", "el", split="test[:2%]") #TODO: replace {lang_id} in your language code here. Make sure the code is one of the *ISO codes* of [this](https://huggingface.co/languages) site.

processor = Wav2Vec2Processor.from_pretrained("vasilis/wav2vec2-large-xlsr-53-finnish") #TODO: replace {model_id} with your model id. The model id consists of {your_username}/{your_modelname}, *e.g.* `elgeish/wav2vec2-large-xlsr-53-arabic`
model = Wav2Vec2ForCTC.from_pretrained("vasilis/wav2vec2-large-xlsr-53-finnish") #TODO: replace {model_id} with your model id. The model id consists of {your_username}/{your_modelname}, *e.g.* `elgeish/wav2vec2-large-xlsr-53-arabic`

resampler = torchaudio.transforms.Resample(48_000, 16_000)

# Preprocessing the datasets.
# We need to read the aduio files as arrays
def speech_file_to_array_fn(batch):
    speech_array, sampling_rate = torchaudio.load(batch["path"])
    batch["speech"] = resampler(speech_array).squeeze().numpy()
    return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)
inputs = processor(test_dataset["speech"][:2], sampling_rate=16_000, return_tensors="pt", padding=True)

with torch.no_grad():
    logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits

predicted_ids = torch.argmax(logits, dim=-1)

print("Prediction:", processor.batch_decode(predicted_ids))
print("Reference:", test_dataset["sentence"][:2])

Advanced Usage

import torch
import torchaudio
from datasets import load_dataset, load_metric
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import re

test_dataset = load_dataset("common_voice", "fi", split="test") #TODO: replace {lang_id} in your language code here. Make sure the code is one of the *ISO codes* of [this](https://huggingface.co/languages) site.
wer = load_metric("wer")

processor = Wav2Vec2Processor.from_pretrained("vasilis/wav2vec2-large-xlsr-53-finnish")
model = Wav2Vec2ForCTC.from_pretrained("vasilis/wav2vec2-large-xlsr-53-finnish")
model.to("cuda")

chars_to_ignore_regex = "[\,\?\.\!\-\;\:\"\“\%\‘\”\�\']"  # TODO: adapt this list to include all special characters you removed from the data
replacements = {"…": "", "–": ''}

resampler = {
    48_000: torchaudio.transforms.Resample(48_000, 16_000),
    44100: torchaudio.transforms.Resample(44100, 16_000),
    32000: torchaudio.transforms.Resample(32000, 16_000)
}


# Preprocessing the datasets.
# We need to read the aduio files as arrays
def speech_file_to_array_fn(batch):
    batch["sentence"] = re.sub(chars_to_ignore_regex, '', batch["sentence"]).lower()
    for key, value in replacements.items():
        batch["sentence"] = batch["sentence"].replace(key, value)
    speech_array, sampling_rate = torchaudio.load(batch["path"])
    batch["speech"] = resampler[sampling_rate](speech_array).squeeze().numpy()
    return batch


test_dataset = test_dataset.map(speech_file_to_array_fn)

# Preprocessing the datasets.
# We need to read the aduio files as arrays
def evaluate(batch):
    inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)

    with torch.no_grad():
        logits = model(inputs.input_values.to("cuda"), attention_mask=inputs.attention_mask.to("cuda")).logits

    pred_ids = torch.argmax(logits, dim=-1)
    batch["pred_strings"] = processor.batch_decode(pred_ids)
    return batch

result = test_dataset.map(evaluate, batched=True, batch_size=8)

print("WER: {:2f}".format(100 * wer.compute(predictions=result["pred_strings"], references=result["sentence"])))
print("CER: {:2f}".format(100 * wer.compute(predictions=[" ".join(list(entry)) for entry in result["pred_strings"]], references=[" ".join(list(entry)) for entry in result["sentence"]])))

📚 Documentation

Model Information

Property	Details
Model Type	V XLSR Wav2Vec2 Large 53 - finnish
Training Data	Common Voice, CSS10 finnish: Single Speaker Speech Dataset
Metrics	WER (Word Error Rate), CER (Character Error Rate)

Test Results

The model achieved a Test WER of 38.335242 and a Test CER of 6.552408 on the Common Voice fi dataset.

🔧 Technical Details

The Common Voice train dataset was used for training. Also all of CSS10 Finnish was used using the normalized transcripts. After 20000 steps the models was finetuned using the common voice train and validation sets for 2000 steps more.

📄 License

This model is licensed under the Apache - 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご