wav2vec2-large-xlsr-53-latvian Open-source Speech Recognition Model

Wav2vec2 Large Xlsr 53 Latvian

Developed by anton-l

This is an automatic speech recognition (ASR) model fine-tuned on the Latvian Common Voice dataset based on Facebook's Wav2Vec2-Large-XLSR-53 model.

Speech Recognition OtherOpen Source License:Apache-2.0 #Latvian speech recognition #XLSR fine-tuning #Low-resource language support

Downloads 137

Release Time : 3/2/2022

Model Overview

This model is specifically designed for Latvian speech recognition tasks, fine-tuned on the Common Voice dataset, and supports voice input with a 16kHz sampling rate.

Model Features

High-accuracy Latvian recognition

Achieves 26.89% WER (Word Error Rate) on the Common Voice test set

Based on XLSR pre-trained model

Fine-tuned using the cross-lingual speech representation learning (XLSR) pre-trained model

No language model required

Can be used directly without additional language model support

Model Capabilities

Latvian speech recognition

16kHz audio processing

End-to-end speech-to-text

Use Cases

Speech transcription

Latvian speech-to-text

Convert Latvian speech content into text

26.89% WER

Voice assistants

Latvian voice command recognition

Used for voice command recognition in Latvian voice assistants or control systems

🚀 Wav2Vec2-Large-XLSR-53-Latvian

This model is fine - tuned from facebook/wav2vec2-large-xlsr-53 on Latvian using the Common Voice dataset. It's designed for automatic speech recognition tasks.

🔍 Key Information

Property	Details
Model Type	Wav2Vec2-Large-XLSR-53-Latvian
Datasets	common_voice
Metrics	wer
Tags	audio, automatic - speech - recognition, speech, xlsr - fine - tuning - week
License	apache - 2.0

📊 Model Index

Name: Latvian XLSR Wav2Vec2 Large 53 by Anton Lozhkov
Results:
- Task:
  - Name: Speech Recognition
  - Type: automatic - speech - recognition
- Dataset:
  - Name: Common Voice lv
  - Type: common_voice
  - Args: lv
- Metrics:
  - Name: Test WER
  - Type: wer
  - Value: 26.89

🚀 Quick Start

When using this model, make sure that your speech input is sampled at 16kHz.

💻 Usage Examples

Basic Usage

import torch
import torchaudio
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

test_dataset = load_dataset("common_voice", "lv", split="test[:2%]")

processor = Wav2Vec2Processor.from_pretrained("anton-l/wav2vec2-large-xlsr-53-latvian")
model = Wav2Vec2ForCTC.from_pretrained("anton-l/wav2vec2-large-xlsr-53-latvian")

resampler = torchaudio.transforms.Resample(48_000, 16_000)

# Preprocessing the datasets.
# We need to read the audio files as arrays
def speech_file_to_array_fn(batch):
    speech_array, sampling_rate = torchaudio.load(batch["path"])
    batch["speech"] = resampler(speech_array).squeeze().numpy()
    return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)
inputs = processor(test_dataset["speech"][:2], sampling_rate=16_000, return_tensors="pt", padding=True)

with torch.no_grad():
    logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits

predicted_ids = torch.argmax(logits, dim=-1)

print("Prediction:", processor.batch_decode(predicted_ids))
print("Reference:", test_dataset["sentence"][:2])

📚 Documentation

Evaluation

The model can be evaluated as follows on the Latvian test data of Common Voice.

import torch
import torchaudio
import urllib.request
import tarfile
import pandas as pd
from tqdm.auto import tqdm
from datasets import load_metric
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

# Download the raw data instead of using HF datasets to save disk space 
data_url = "https://voice-prod-bundler-ee1969a6ce8178826482b88e843c335139bd3fb4.s3.amazonaws.com/cv-corpus-6.1-2020-12-11/lv.tar.gz"
filestream = urllib.request.urlopen(data_url)
data_file = tarfile.open(fileobj=filestream, mode="r|gz")
data_file.extractall()

wer = load_metric("wer")

processor = Wav2Vec2Processor.from_pretrained("anton-l/wav2vec2-large-xlsr-53-latvian")
model = Wav2Vec2ForCTC.from_pretrained("anton-l/wav2vec2-large-xlsr-53-latvian")
model.to("cuda")

cv_test = pd.read_csv("cv-corpus-6.1-2020-12-11/lv/test.tsv", sep='\t')
clips_path = "cv-corpus-6.1-2020-12-11/lv/clips/"

def clean_sentence(sent):
    sent = sent.lower()
    # replace non-alpha characters with space
    sent = "".join(ch if ch.isalpha() else " " for ch in sent)
    # remove repeated spaces
    sent = " ".join(sent.split())
    return sent

targets = []
preds = []

for i, row in tqdm(cv_test.iterrows(), total=cv_test.shape[0]):
    row["sentence"] = clean_sentence(row["sentence"])
    speech_array, sampling_rate = torchaudio.load(clips_path + row["path"])
    resampler = torchaudio.transforms.Resample(sampling_rate, 16_000)
    row["speech"] = resampler(speech_array).squeeze().numpy()

    inputs = processor(row["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)

    with torch.no_grad():
        logits = model(inputs.input_values.to("cuda"), attention_mask=inputs.attention_mask.to("cuda")).logits

    pred_ids = torch.argmax(logits, dim=-1)

    targets.append(row["sentence"])
    preds.append(processor.batch_decode(pred_ids)[0])

print("WER: {:2f}".format(100 * wer.compute(predictions=preds, references=targets)))

Test Result: 26.89 %

Training

The Common Voice train and validation datasets were used for training.

📄 License

This project is licensed under the apache - 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご