wav2vec2-large-xlsr-53-basque Open-source Speech Recognition Model - Free Deployment for Precise Basque Language Recognition

Wav2vec2 Large Xlsr 53 Basque

Developed by stefan-it

An automatic speech recognition model fine-tuned on Basque data from the Common Voice dataset, based on facebook/wav2vec2-large-xlsr-53

Speech Recognition

Transformers

OtherOpen Source License:Apache-2.0 #Basque speech recognition #XLSR fine-tuned model #Low WER transcription

Downloads 10.70k

Release Time : 3/2/2022

Model Overview

This is an automatic speech recognition (ASR) model optimized for Basque, based on the Wav2Vec2 architecture, suitable for converting Basque speech into text.

Model Features

High Accuracy Basque Recognition

Achieves 18.27% WER (Word Error Rate) on the Basque test set of Common Voice

No Language Model Required

Can be used directly without additional language model support

16kHz Sampling Rate Support

Optimized for 16kHz sampled speech input

Model Capabilities

Basque speech recognition

Speech-to-text

Automatic speech transcription

Use Cases

Speech Transcription

Basque Speech Transcription

Convert Basque speech content into text

18.27% Word Error Rate

Voice Assistants

Basque Voice Command Recognition

Speech recognition component for Basque voice assistants or voice control systems

🚀 Wav2Vec2-Large-XLSR-53-Basque

This model is a fine - tuned version of facebook/wav2vec2-large-xlsr-53 in Basque, leveraging the Common Voice dataset. It's designed for speech recognition tasks.

🚀 Quick Start

This model is a fine - tuned version of facebook/wav2vec2-large-xlsr-53 in Basque using the Common Voice. When using this model, ensure that your speech input is sampled at 16kHz.

✨ Features

Language: Basque
Datasets: Utilizes the Common Voice dataset
Tags: Applicable for audio, automatic - speech - recognition, speech, and xlsr - fine - tuning - week
License: Licensed under Apache 2.0

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

import torch
import torchaudio
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

test_dataset = load_dataset("common_voice", "eu", split="test[:2%]")
processor = Wav2Vec2Processor.from_pretrained("stefan-it/wav2vec2-large-xlsr-53-basque")
model = Wav2Vec2ForCTC.from_pretrained("stefan-it/wav2vec2-large-xlsr-53-basque")
resampler = torchaudio.transforms.Resample(48_000, 16_000)

# Preprocessing the datasets.
# We need to read the aduio files as arrays
def speech_file_to_array_fn(batch):
    speech_array, sampling_rate = torchaudio.load(batch["path"])
    batch["speech"] = resampler(speech_array).squeeze().numpy()
    return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)
inputs = processor(test_dataset["speech"][:2], sampling_rate=16_000, return_tensors="pt", padding=True)
with torch.no_grad():
    logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits
predicted_ids = torch.argmax(logits, dim=-1)
print("Prediction:", processor.batch_decode(predicted_ids))
print("Reference:", test_dataset["sentence"][:2])

Advanced Usage

import torch
import torchaudio
from datasets import load_dataset, load_metric
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import re

test_dataset = load_dataset("common_voice", "eu", split="test")
wer = load_metric("wer")
processor = Wav2Vec2Processor.from_pretrained("stefan-it/wav2vec2-large-xlsr-53-basque")
model = Wav2Vec2ForCTC.from_pretrained("stefan-it/wav2vec2-large-xlsr-53-basque")
model.to("cuda")
chars_to_ignore_regex = '[\\\\,\\\\?\\\\.\\\\!\\\\-\\\\;\\\\:\\\\"\\\\“\\\\%\\\\‘\\\\”\\\\�]'
resampler = torchaudio.transforms.Resample(48_000, 16_000)

# Preprocessing the datasets.
# We need to read the aduio files as arrays
def speech_file_to_array_fn(batch):
    batch["sentence"] = re.sub(chars_to_ignore_regex, '', batch["sentence"]).lower()
    speech_array, sampling_rate = torchaudio.load(batch["path"])
    batch["speech"] = resampler(speech_array).squeeze().numpy()
    return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)

# Preprocessing the datasets.
# We need to read the aduio files as arrays
def evaluate(batch):
    inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)
    with torch.no_grad():
        logits = model(inputs.input_values.to("cuda"), attention_mask=inputs.attention_mask.to("cuda")).logits
    pred_ids = torch.argmax(logits, dim=-1)
    batch["pred_strings"] = processor.batch_decode(pred_ids)
    return batch

result = test_dataset.map(evaluate, batched=True, batch_size=8)
print("WER: {:2f}".format(100 * wer.compute(predictions=result["pred_strings"], references=result["sentence"])))

Test Result

The model achieved a Word Error Rate (WER) of 18.272625% on the test dataset.

📚 Documentation

Training

The Common Voice train and validation datasets were used for training. The training script will hopefully be available soon.

Acknowledgements

Many thanks to the OVH team for providing access to a V - 100 instance. Without their help, fine - tuning would not be possible! I would also thank Manuel Romero (mrm8488) for helping with the fine - tuning script!

📄 License

This project is licensed under the Apache 2.0 license.

Property	Details
Model Type	Wav2Vec2 - Large - XLSR - 53 - Basque
Training Data	Common Voice (train and validation datasets)
Test Dataset	Common Voice eu (test dataset)
Test WER	18.272625
License	Apache - 2.0

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご