wav2vec2-large-xlsr-53-spanish Open-source Speech Recognition Model - Free Deployment for Precise Spanish Recognition

Wav2vec2 Large Xlsr 53 Spanish

Developed by facebook

A large-scale cross-lingual speech recognition model based on the Wav2Vec2 architecture, specifically optimized for Spanish, released by Facebook

Speech Recognition SpanishOpen Source License:Apache-2.0 #Spanish speech recognition #Low error rate ASR #Large model transfer learning

Downloads 66.63k

Release Time : 3/2/2022

Model Overview

This model is an automatic speech recognition (ASR) model based on the Wav2Vec2 architecture, trained on the XLSR-53 dataset and specifically optimized for Spanish speech recognition tasks.

Model Features

Cross-lingual Pretraining

Trained on the XLSR-53 dataset with cross-lingual transfer learning capabilities

High Accuracy

Achieves a word error rate (WER) of 17.6% on the Common Voice Spanish test set

End-to-End Speech Recognition

Generates text output directly from raw audio input without complex feature engineering

Model Capabilities

Spanish speech-to-text

Continuous speech recognition

Audio feature extraction

Use Cases

Speech Transcription

Voice Memo Transcription

Automatically converts Spanish voice memos into text

Accuracy approximately 82.4%

Customer Service Call Logging

Automatically records and transcribes Spanish customer service calls

Assistive Technology

Voice-Controlled Interface

Provides voice control functionality for Spanish-speaking users

🚀 Speech Recognition Model on Common Voice ES

This project focuses on automatic speech recognition using the Common Voice Spanish dataset. It evaluates a pre - trained model on the test set of the dataset.

🚀 Quick Start

Prerequisites

Make sure you have the necessary libraries installed, such as torchaudio, datasets, transformers, torch, etc.

Evaluation Steps

The following Python code demonstrates how to evaluate the model on the Common Voice ES test set:

import torchaudio
from datasets import load_dataset, load_metric
from transformers import (
    Wav2Vec2ForCTC,
    Wav2Vec2Processor,
)
import torch
import re
import sys

model_name = "facebook/wav2vec2-large-xlsr-53-spanish"
device = "cuda"

chars_to_ignore_regex = '[\,\?\.\!\-\;\:\"]'  # noqa: W605

model = Wav2Vec2ForCTC.from_pretrained(model_name).to(device)
processor = Wav2Vec2Processor.from_pretrained(model_name)

ds = load_dataset("common_voice", "es", split="test", data_dir="./cv-corpus-6.1-2020-12-11")

resampler = torchaudio.transforms.Resample(orig_freq=48_000, new_freq=16_000)

def map_to_array(batch):
    speech, _ = torchaudio.load(batch["path"])
    batch["speech"] = resampler.forward(speech.squeeze(0)).numpy()
    batch["sampling_rate"] = resampler.new_freq
    batch["sentence"] = re.sub(chars_to_ignore_regex, '', batch["sentence"]).lower().replace("’", "'")
    return batch
    
ds = ds.map(map_to_array)

def map_to_pred(batch):
    features = processor(batch["speech"], sampling_rate=batch["sampling_rate"][0], padding=True, return_tensors="pt")
    input_values = features.input_values.to(device)
    attention_mask = features.attention_mask.to(device)
    with torch.no_grad():
        logits = model(input_values, attention_mask=attention_mask).logits
    pred_ids = torch.argmax(logits, dim=-1)
    batch["predicted"] = processor.batch_decode(pred_ids)
    batch["target"] = batch["sentence"]
    return batch
    
result = ds.map(map_to_pred, batched=True, batch_size=16, remove_columns=list(ds.features.keys()))

wer = load_metric("wer")

print(wer.compute(predictions=result["predicted"], references=result["target"]))

Result

The Word Error Rate (WER) on the test set is 17.6%.

📦 Information Table

Property	Details
Model Type	Wav2Vec2ForCTC
Training Data	Common Voice Spanish dataset
License	Apache - 2.0
Tags	speech, audio, automatic - speech - recognition

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご