Open-source Model wav2vec2-large-xlsr-53-esperanto - Precise Implementation of Esperanto Speech Recognition

Wav2vec2 Large Xlsr 53 Esperanto

Developed by cpierse

This is an Esperanto speech recognition model fine-tuned from Facebook's wav2vec2-large-xlsr-53 model, trained using the Common Voice dataset.

Speech Recognition OtherOpen Source License:Apache-2.0 #Esperanto speech recognition #Low WER #Multilingual support

Downloads 8,681

Release Time : 3/2/2022

Model Overview

This model is specifically designed for automatic speech recognition (ASR) tasks in Esperanto, capable of converting Esperanto speech into text.

Model Features

High accuracy Esperanto recognition

Achieves 12.31% WER (Word Error Rate) on the Common Voice Esperanto test set

Based on XLSR-53 architecture

Utilizes a cross-lingual pre-trained large-scale model for fine-tuning, with powerful speech feature extraction capabilities

No language model required

Can be used directly without additional language model support

Model Capabilities

Esperanto speech recognition

Speech-to-text conversion

16kHz audio processing

Use Cases

Speech transcription

Esperanto speech transcription

Convert Esperanto speech content into text format

12.31% WER

Assistive tools

Esperanto learning aid

Help Esperanto learners verify pronunciation accuracy

🚀 Wav2Vec2-Large-XLSR-53-eo

This model is fine-tuned from facebook/wav2vec2-large-xlsr-53 on Esperanto using the Common Voice dataset, aiming to provide high - quality automatic speech recognition for Esperanto.

🚀 Quick Start

Fine-tuned facebook/wav2vec2-large-xlsr-53 on Esperanto using the Common Voice dataset.

When using this model, make sure that your speech input is sampled at 16kHz.

✨ Features

Dataset Utilization: Trained on the Esperanto subset of the Common Voice dataset.
Model Type: Fine - tuned from the large - scale XLSR model, suitable for Esperanto speech recognition.

📦 Installation

No specific installation steps are provided in the original README. However, you need to have the necessary Python libraries installed, such as torch, torchaudio, datasets, and transformers. You can install them using pip:

pip install torch torchaudio datasets transformers

💻 Usage Examples

Basic Usage

import torch
import torchaudio
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

test_dataset = load_dataset("common_voice", "eo", split="test[:2%]") 
processor = Wav2Vec2Processor.from_pretrained("cpierse/wav2vec2-large-xlsr-53-esperanto") 
model = Wav2Vec2ForCTC.from_pretrained("cpierse/wav2vec2-large-xlsr-53-esperanto") 

resampler = torchaudio.transforms.Resample(48_000, 16_000)

# Preprocessing the datasets.
# We need to read the aduio files as arrays
def speech_file_to_array_fn(batch):
   speech_array, sampling_rate = torchaudio.load(batch["path"])
   batch["speech"] = resampler(speech_array).squeeze().numpy()
   return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)
inputs = processor(test_dataset["speech"][:2], sampling_rate=16_000, return_tensors="pt", padding=True)

with torch.no_grad():
   logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits

predicted_ids = torch.argmax(logits, dim=-1)

print("Prediction:", processor.batch_decode(predicted_ids))
print("Reference:", test_dataset["sentence"][:2])

Advanced Usage

import torch
import torchaudio
from datasets import load_dataset, load_metric
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import re
import jiwer

def chunked_wer(targets, predictions, chunk_size=None):
    if chunk_size is None: return jiwer.wer(targets, predictions)
    start = 0
    end = chunk_size
    H, S, D, I = 0, 0, 0, 0
    while start < len(targets):
        chunk_metrics = jiwer.compute_measures(targets[start:end], predictions[start:end])
        H = H + chunk_metrics["hits"]
        S = S + chunk_metrics["substitutions"]
        D = D + chunk_metrics["deletions"]
        I = I + chunk_metrics["insertions"]
        start += chunk_size
        end += chunk_size
    return float(S + D + I) / float(H + S + D)

test_dataset = load_dataset("common_voice", "eo", split="test") #TODO: replace {lang_id} in your language code here. Make sure the code is one of the *ISO codes* of [this](https://huggingface.co/languages) site.
wer = load_metric("wer")

processor = Wav2Vec2Processor.from_pretrained("cpierse/wav2vec2-large-xlsr-53-esperanto")
model = Wav2Vec2ForCTC.from_pretrained("cpierse/wav2vec2-large-xlsr-53-esperanto")
model.to("cuda")

chars_to_ignore_regex = '[\,\?\.\!\-\;\:\"\“\%\‘\”\�\„\«\(\»\)\’\']' 
resampler = torchaudio.transforms.Resample(48_000, 16_000)

# Preprocessing the datasets.
# We need to read the aduio files as arrays
def speech_file_to_array_fn(batch):
   batch["sentence"] = re.sub(chars_to_ignore_regex, '', batch["sentence"]).lower()
   speech_array, sampling_rate = torchaudio.load(batch["path"])
   batch["speech"] = resampler(speech_array).squeeze().numpy()
   return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)

# Preprocessing the datasets.
# We need to read the aduio files as arrays
def evaluate(batch):
   inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)

   with torch.no_grad():
      logits = model(inputs.input_values.to("cuda"), attention_mask=inputs.attention_mask.to("cuda")).logits

   pred_ids = torch.argmax(logits, dim=-1)
   batch["pred_strings"] = processor.batch_decode(pred_ids)
   return batch

result = test_dataset.map(evaluate, batched=True, batch_size=8)

print("WER: {:2f}".format(100 * chunked_wer(predictions=result["pred_strings"], targets=result["sentence"],chunk_size=2000)))

📚 Documentation

Model Information

Property	Details
Model Type	Fine - tuned Wav2Vec2 - Large - XLSR - 53 for Esperanto
Training Data	Common Voice Esperanto `train` and `validation` datasets

Test Result

Test Result: 12.31 %

📄 License

This model is licensed under the apache - 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご