# wav2vec2-large-xlsr-catala Open-source Speech Recognition Model

Wav2vec2 Large Xlsr Catala

Developed by softcatala

Catalan speech recognition model fine-tuned from facebook/wav2vec2-large-xlsr-53, trained on Common Voice and parliamentary speech datasets

Speech Recognition OtherOpen Source License:Apache-2.0 #Catalan speech recognition #Low Word Error Rate (WER)#Parliament speech adaptation

Downloads 64.30k

Release Time : 3/2/2022

Model Overview

This is a model for Catalan Automatic Speech Recognition (ASR), capable of converting Catalan speech into text.

Model Features

Multi-dataset training

Combined training on both Common Voice and parliamentary speech datasets, improving model generalization

Low Word Error Rate

Achieves 6.92% Word Error Rate (WER) on test sets, demonstrating excellent performance

No language model required

Can be used directly without additional language model support

Model Capabilities

Catalan speech recognition

Speech-to-text

Use Cases

Speech transcription

Parliamentary recording transcription

Convert parliamentary meeting recordings into text records

Performs well on parliamentary speech test sets

Audiobook transcription

Convert Catalan audiobooks into text

Achieves 13.23% WER on 'The Legend of Saint George' audiobook

Voice assistants

Catalan voice command recognition

Used for supporting Catalan-language voice assistants and smart devices

🚀 Wav2Vec2-Large-XLSR-Català

This model is a fine - tuned version of facebook/wav2vec2-large-xlsr-53 on the Catalan language, leveraging the Common Voice and ParlamentParla datasets, aiming to provide high - quality automatic speech recognition for Catalan.

🚀 Quick Start

The model is a fine - tuned version of facebook/wav2vec2-large-xlsr-53 on the Catalan language, using the Common Voice and ParlamentParla datasets.

⚠️ Important Note

The split train/dev/test used does not fully map with the CommonVoice 6.1 dataset. A custom split was used combining both the CommonVoice and ParlamentParla dataset and can be found here. Evaluating on the CV test dataset will produce a biased WER as 1144 audio files of that dataset were used in training/evaluation of this model. WER was calculated using this test.csv which was not seen by the model during training/evaluation.

You can find training and evaluation scripts in the github repository ccoreilly/wav2vec2-catala.

💡 Usage Tip

When using this model, make sure that your speech input is sampled at 16kHz.

✨ Features

Fine - tuned on Catalan language with high - quality datasets.
Capable of providing automatic speech recognition for Catalan.

📦 Installation

No specific installation steps are provided in the original document, so this section is skipped.

💻 Usage Examples

Basic Usage

import torch
import torchaudio
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

test_dataset = load_dataset("common_voice", "ca", split="test[:2%]")

processor = Wav2Vec2Processor.from_pretrained("ccoreilly/wav2vec2-large-xlsr-catala") 
model = Wav2Vec2ForCTC.from_pretrained("ccoreilly/wav2vec2-large-xlsr-catala")

resampler = torchaudio.transforms.Resample(48_000, 16_000)

# Preprocessing the datasets.
# We need to read the audio files as arrays
def speech_file_to_array_fn(batch):
	speech_array, sampling_rate = torchaudio.load(batch["path"])
	batch["speech"] = resampler(speech_array).squeeze().numpy()
	return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)
inputs = processor(test_dataset["speech"][:2], sampling_rate=16_000, return_tensors="pt", padding=True)

with torch.no_grad():
	logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits

predicted_ids = torch.argmax(logits, dim=-1)

print("Prediction:", processor.batch_decode(predicted_ids))
print("Reference:", test_dataset["sentence"][:2])

📚 Documentation

Results

Word error rate was evaluated on the following datasets unseen by the model:

Dataset	WER
Test split CV+ParlamentParla	6.92%
Google Crowsourced Corpus	12.99%
Audiobook “La llegenda de Sant Jordi”	13.23%

📄 License

This model is licensed under the apache - 2.0 license.

Property	Details
Model Type	Wav2Vec2-Large-XLSR-Català
Training Data	common_voice, parlament_parla
Metrics	wer
Tags	audio, automatic-speech-recognition, speech, xlsr-fine-tuning-week
License	apache-2.0

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご