Wav2vec2-large-XLRS-Estonian open-source speech recognition model - Realize Estonian speech-to-text with free deployment

Wav2vec2 Large Xlrs Estonian

Developed by birgermoell

This is an automatic speech recognition (ASR) model fine-tuned on the Estonian Common Voice dataset, based on the facebook/wav2vec2-large-xlsr-53 model.

Speech Recognition OtherOpen Source License:Apache-2.0 #Estonian speech recognition #XLSR fine-tuned model #Low-resource language processing

Downloads 18

Release Time : 3/2/2022

Model Overview

This model is specifically designed for Estonian speech recognition tasks, built on the Wav2Vec2 architecture and fine-tuned on the Common Voice dataset.

Model Features

XLSR Fine-tuning

Fine-tuned on Estonian language based on the large-scale multilingual pre-trained model XLSR-53

16kHz Sampling Rate Support

Specifically designed to process speech input with 16kHz sampling rate

No Language Model Required

Can be used directly without additional language models

Model Capabilities

Estonian speech recognition

Audio-to-text conversion

Use Cases

Speech Transcription

Estonian Speech-to-Text

Convert Estonian speech into text content

WER 36.95%

🚀 Wav2Vec2-Large-XLSR-53-Estonian

This model is a fine - tuned version of facebook/wav2vec2-large-xlsr-53 in Estonian using the Common Voice. It can be used for automatic speech recognition tasks. Ensure your speech input is sampled at 16kHz when using this model.

📋 Model Information

Property	Details
Language	Estonian
Datasets	Common Voice
Tags	audio, automatic - speech - recognition, speech, xlsr - fine - tuning - week
License	apache - 2.0
Model Name	XLSR Wav2Vec2 Estonian by Birger Moell
Task	Speech Recognition (automatic - speech - recognition)
Dataset Name	Common Voice Estonian
Dataset Type	common_voice
Test WER	36.951816

🚀 Quick Start

When using this model, make sure that your speech input is sampled at 16kHz.

✨ Features

Fine - tuned from facebook/wav2vec2-large-xlsr-53 for Estonian speech recognition.
Can be used directly without a language model.

📦 Installation

No specific installation steps are provided in the original document, so this section is skipped.

💻 Usage Examples

Basic Usage

import torch
import torchaudio
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

test_dataset = load_dataset("common_voice", "et", split="test[:2%]")

processor = Wav2Vec2Processor.from_pretrained("birgermoell/wav2vec2-large-xlrs-estonian")
model = Wav2Vec2ForCTC.from_pretrained("birgermoell/wav2vec2-large-xlrs-estonian")

resampler = torchaudio.transforms.Resample(48_000, 16_000)

# Preprocessing the datasets.
# We need to read the aduio files as arrays
def speech_file_to_array_fn(batch):
    speech_array, sampling_rate = torchaudio.load(batch["path"])
    batch["speech"] = resampler(speech_array).squeeze().numpy()
    return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)
inputs = processor(test_dataset["speech"][:2], sampling_rate=16_000, return_tensors="pt", padding=True)

with torch.no_grad():
    logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits

predicted_ids = torch.argmax(logits, dim=-1)

print("Prediction:", processor.batch_decode(predicted_ids))
print("Reference:", test_dataset["sentence"][:2])

Advanced Usage

The advanced usage here is the evaluation process of the model.

import torch
import torchaudio
from datasets import load_dataset, load_metric
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import re

test_dataset = load_dataset("common_voice", "fi", split="test")
wer = load_metric("wer")

processor = Wav2Vec2Processor.from_pretrained("birgermoell/wav2vec2-large-xlrs-estonian")
model = Wav2Vec2ForCTC.from_pretrained("birgermoell/wav2vec2-large-xlrs-estonian")
model.to("cuda")

chars_to_ignore_regex = '[\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\,\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\?\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\.\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\!\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\-\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\;\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\:\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\"\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\“]'
resampler = torchaudio.transforms.Resample(48_000, 16_000)

# Preprocessing the datasets.
# We need to read the aduio files as arrays
def speech_file_to_array_fn(batch):
    batch["sentence"] = re.sub(chars_to_ignore_regex, '', batch["sentence"]).lower()
    speech_array, sampling_rate = torchaudio.load(batch["path"])
    batch["speech"] = resampler(speech_array).squeeze().numpy()
    return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)

# Preprocessing the datasets.
# We need to read the aduio files as arrays
def evaluate(batch):
    inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)

    with torch.no_grad():
        logits = model(inputs.input_values.to("cuda"), attention_mask=inputs.attention_mask.to("cuda")).logits
    pred_ids = torch.argmax(logits, dim=-1)
    
    batch["pred_strings"] = processor.batch_decode(pred_ids)
    return batch

result = test_dataset.map(evaluate, batched=True, batch_size=8)

print("WER: {:2f}".format(100 * wer.compute(predictions=result["pred_strings"], references=result["sentence"])))

📚 Documentation

Test Result: WER: 36.951816

🔧 Technical Details

The Common Voice train and validation datasets were used for training. The script used for training can be found here https://colab.research.google.com/drive/1VcWT92vBCwVn - 5d - mkYxhgILPr11OHfR?usp=sharing

📄 License

This model is licensed under the apache - 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご