wav2vec2-large-xlsr-53-lithuanian Open-source Model - Free Implementation of Lithuanian Automatic Speech Recognition

Wav2vec2 Large Xlsr 53 Lithuanian

Developed by anton-l

An automatic speech recognition model fine-tuned for Lithuanian using the Common Voice dataset, based on the facebook/wav2vec2-large-xlsr-53 model.

Speech Recognition OtherOpen Source License:Apache-2.0 #Lithuanian speech recognition #XLSR fine-tuning #Low-resource language

Downloads 29

Release Time : 3/2/2022

Model Overview

This is an automatic speech recognition (ASR) model for Lithuanian, capable of converting Lithuanian speech into text.

Model Features

Lithuanian optimization

Specially fine-tuned for Lithuanian to improve recognition accuracy for this language.

Based on XLSR-53 architecture

Utilizes a large-scale multilingual pre-trained model as the foundation, with powerful speech feature extraction capabilities.

16kHz sampling rate support

Supports standard 16kHz sampling rate audio input, suitable for most speech application scenarios.

Model Capabilities

Lithuanian speech recognition

Speech-to-text

Automatic speech transcription

Use Cases

Speech transcription

Lithuanian speech-to-text

Convert Lithuanian speech content into editable text format

Achieves a WER of 49.00% on the Common Voice test set

Voice assistants

Lithuanian voice command recognition

Used for developing voice assistants and control systems that support Lithuanian

🚀 Wav2Vec2-Large-XLSR-53-Lithuanian

This model is fine-tuned from facebook/wav2vec2-large-xlsr-53 on Lithuanian using the Common Voice dataset. It aims to provide high - quality speech recognition for Lithuanian.

🚀 Quick Start

This model is a fine - tuned version of facebook/wav2vec2-large-xlsr-53 on Lithuanian with the Common Voice dataset. When using this model, ensure that your speech input is sampled at 16kHz.

✨ Features

Language Adaptation: Fine - tuned specifically for Lithuanian, enhancing speech recognition performance in this language.
High - Frequency Compatibility: Requires speech input to be sampled at 16kHz for optimal results.

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

import torch
import torchaudio
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

test_dataset = load_dataset("common_voice", "lt", split="test[:2%]")

processor = Wav2Vec2Processor.from_pretrained("anton-l/wav2vec2-large-xlsr-53-lithuanian")
model = Wav2Vec2ForCTC.from_pretrained("anton-l/wav2vec2-large-xlsr-53-lithuanian")

resampler = torchaudio.transforms.Resample(48_000, 16_000)

# Preprocessing the datasets.
# We need to read the audio files as arrays
def speech_file_to_array_fn(batch):
    speech_array, sampling_rate = torchaudio.load(batch["path"])
    batch["speech"] = resampler(speech_array).squeeze().numpy()
    return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)
inputs = processor(test_dataset["speech"][:2], sampling_rate=16_000, return_tensors="pt", padding=True)

with torch.no_grad():
    logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits

predicted_ids = torch.argmax(logits, dim=-1)

print("Prediction:", processor.batch_decode(predicted_ids))
print("Reference:", test_dataset["sentence"][:2])

Advanced Usage

import torch
import torchaudio
import urllib.request
import tarfile
import pandas as pd
from tqdm.auto import tqdm
from datasets import load_metric
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

# Download the raw data instead of using HF datasets to save disk space 
data_url = "https://voice-prod-bundler-ee1969a6ce8178826482b88e843c335139bd3fb4.s3.amazonaws.com/cv-corpus-6.1-2020-12-11/lt.tar.gz"
filestream = urllib.request.urlopen(data_url)
data_file = tarfile.open(fileobj=filestream, mode="r|gz")
data_file.extractall()

wer = load_metric("wer")

processor = Wav2Vec2Processor.from_pretrained("anton-l/wav2vec2-large-xlsr-53-lithuanian")
model = Wav2Vec2ForCTC.from_pretrained("anton-l/wav2vec2-large-xlsr-53-lithuanian")
model.to("cuda")

cv_test = pd.read_csv("cv-corpus-6.1-2020-12-11/lt/test.tsv", sep='\t')
clips_path = "cv-corpus-6.1-2020-12-11/lt/clips/"

def clean_sentence(sent):
    sent = sent.lower()
    # normalize apostrophes
    sent = sent.replace("’", "'")
    # replace non-alpha characters with space
    sent = "".join(ch if ch.isalpha() or ch == "'" else " " for ch in sent)
    # remove repeated spaces
    sent = " ".join(sent.split())
    return sent

targets = []
preds = []

for i, row in tqdm(cv_test.iterrows(), total=cv_test.shape[0]):
    row["sentence"] = clean_sentence(row["sentence"])
    speech_array, sampling_rate = torchaudio.load(clips_path + row["path"])
    resampler = torchaudio.transforms.Resample(sampling_rate, 16_000)
    row["speech"] = resampler(speech_array).squeeze().numpy()

    inputs = processor(row["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)

    with torch.no_grad():
        logits = model(inputs.input_values.to("cuda"), attention_mask=inputs.attention_mask.to("cuda")).logits

    pred_ids = torch.argmax(logits, dim=-1)

    targets.append(row["sentence"])
    preds.append(processor.batch_decode(pred_ids)[0])

print("WER: {:2f}".format(100 * wer.compute(predictions=preds, references=targets)))

Test Result: 49.00 %

📚 Documentation

Model Information

Property	Details
Model Type	Fine - tuned Wav2Vec2 - Large - XLSR - 53 for Lithuanian
Training Data	Common Voice `train` and `validation` datasets
Metrics	Word Error Rate (WER)

Model Index

Name: Lithuanian XLSR Wav2Vec2 Large 53 by Anton Lozhkov
- Results:
  - Task:
    - Name: Speech Recognition
    - Type: automatic - speech - recognition
  - Dataset:
    - Name: Common Voice lt
    - Type: common_voice
    - Args: lt
  - Metrics:
    - Name: Test WER
    - Type: wer
    - Value: 49.00

🔧 Technical Details

The model is fine - tuned from facebook/wav2vec2-large-xlsr-53 on the Lithuanian subset of the Common Voice dataset. During usage, it is crucial to ensure that the speech input is sampled at 16kHz to achieve accurate results.

📄 License

This model is licensed under the apache - 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご