wav2vec2-large-xlsr-53-lithuanian Open Source Model - Precise and Efficient Lithuanian Speech Recognition

Wav2vec2 Large Xlsr 53 Lithuanian

Developed by DeividasM

A Lithuanian speech recognition model fine-tuned from Facebook's XLSR-53 large model, trained on the Common Voice dataset with a test WER of 56.55%.

Speech Recognition OtherOpen Source License:Apache-2.0 #Lithuanian speech recognition #XLSR fine-tuned model #Low-resource language processing

Downloads 4,105

Release Time : 3/2/2022

Model Overview

This model is an automatic speech recognition (ASR) system optimized for Lithuanian, suitable for converting Lithuanian audio into text.

Model Features

XLSR Fine-tuning

Fine-tuned from Facebook's XLSR-53 large model to adapt to Lithuanian language characteristics

16kHz Sampling Rate Support

Specifically optimized for processing speech input at 16kHz sampling rate

No Language Model Required

Can be used directly without additional language model support

Model Capabilities

Lithuanian speech recognition

Audio-to-text conversion

Automatic speech transcription

Use Cases

Speech Transcription

Speech-to-Text Service

Automatically convert Lithuanian speech content into text

Test WER 56.55%

Voice Assistant

Support for Lithuanian voice interaction systems

🚀 Wav2Vec2-Large-XLSR-53-Lithuanian

This model is a fine - tuned version of facebook/wav2vec2-large-xlsr-53 in Lithuanian, leveraging the Common Voice dataset. It's designed for automatic speech recognition tasks.

Dataset and Tags

Datasets: common_voice
Tags: audio, automatic-speech-recognition, speech, xlsr-fine-tuning-week
License: apache-2.0

Model Index

Name: XLSR Wav2Vec2 Lithuanina by Deividas Mataciunas
Results:
- Task:
  - Name: Speech Recognition
  - Type: automatic-speech-recognition
- Dataset:
  - Name: Common Voice lt
  - Type: common_voice
  - Args: lt
- Metrics:
  - Name: Test WER
  - Type: wer
  - Value: 56.55

🚀 Quick Start

When using this model, make sure that your speech input is sampled at 16kHz.

✨ Features

Fine - tuned on the Lithuanian language using the Common Voice dataset.
Can be used for automatic speech recognition tasks without a language model.

📦 Installation

No specific installation steps are provided in the original document, so this section is skipped.

💻 Usage Examples

Basic Usage

import torch
import torchaudio
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

test_dataset = load_dataset("common_voice", "lt", split="test[:2%]")
processor = Wav2Vec2Processor.from_pretrained("DeividasM/wav2vec2-large-xlsr-53-lithuanian")
model = Wav2Vec2ForCTC.from_pretrained("DeividasM/wav2vec2-large-xlsr-53-lithuanian")
resampler = torchaudio.transforms.Resample(48_000, 16_000)

# Preprocessing the datasets.
# We need to read the audio files as arrays
def speech_file_to_array_fn(batch):
    speech_array, sampling_rate = torchaudio.load(batch["path"])
    batch["speech"] = resampler(speech_array).squeeze().numpy()
    return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)
inputs = processor(test_dataset["speech"][:2], sampling_rate=16_000, return_tensors="pt", padding=True)
with torch.no_grad():
    logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits
predicted_ids = torch.argmax(logits, dim=-1)
print("Prediction:", processor.batch_decode(predicted_ids))
print("Reference:", test_dataset["sentence"][:2])

Advanced Usage

import torch
import torchaudio
from datasets import load_dataset, load_metric
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import re

test_dataset = load_dataset("common_voice", "lt", split="test")
wer = load_metric("wer")
processor = Wav2Vec2Processor.from_pretrained("DeividasM/wav2vec2-large-xlsr-53-lithuanian")
model = Wav2Vec2ForCTC.from_pretrained("DeividasM/wav2vec2-large-xlsr-53-lithuanian")
model.to("cuda")
chars_to_ignore_regex = '[\\\\,\\\\?\\\\.\\\\!\\\\-\\\\;\\\\:\\\\"\\\\“]'
resampler = torchaudio.transforms.Resample(48_000, 16_000)

# Preprocessing the datasets.
# We need to read the audio files as arrays
def speech_file_to_array_fn(batch):
    batch["sentence"] = re.sub(chars_to_ignore_regex, '', batch["sentence"]).lower()
    speech_array, sampling_rate = torchaudio.load(batch["path"])
    batch["speech"] = resampler(speech_array).squeeze().numpy()
    return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)

# Preprocessing the datasets.
# We need to read the audio files as arrays
def evaluate(batch):
    inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)
    with torch.no_grad():
        logits = model(inputs.input_values.to("cuda"), attention_mask=inputs.attention_mask.to("cuda")).logits
    pred_ids = torch.argmax(logits, dim=-1)
    batch["pred_strings"] = processor.batch_decode(pred_ids)
    return batch

result = test_dataset.map(evaluate, batched=True, batch_size=8)
print("WER: {:2f}".format(100 * wer.compute(predictions=result["pred_strings"], references=result["sentence"])))

Test Result: 56.55 %

📚 Documentation

The Common Voice train, validation datasets were used for training.

📄 License

This model is licensed under the apache - 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご