wav2vec2-large-xlsr-catala Open Source Model - Free Automatic Speech Recognition for Catalan

Wav2vec2 Large Xlsr Catala

Developed by ccoreilly

Catalan automatic speech recognition model fine-tuned based on facebook/wav2vec2-large-xlsr-53

Speech Recognition OtherOpen Source License:Apache-2.0 #Catalan speech recognition #Low word error rate #Parliament speech optimization

Downloads 31

Release Time : 3/2/2022

Model Overview

This model is an automatic speech recognition (ASR) model optimized for Catalan, fine-tuned using the Common Voice and ParlamentParla datasets, supporting 16kHz sampling rate audio input.

Model Features

Multi-dataset fine-tuning

Trained with both Common Voice and ParlamentParla datasets to enhance model adaptability

Low word error rate

Achieves a word error rate (WER) of 6.92% on the test set, demonstrating excellent performance

No language model required

Can be used directly without additional language model support

Model Capabilities

Speech recognition

Catalan speech-to-text

16kHz audio processing

Use Cases

Speech transcription

Parliament speech transcription

Convert Catalan parliamentary speeches into text

Performs well on the ParlamentParla dataset

Audiobook transcription

Convert Catalan audiobook content into text

Achieves a WER of 13.23% on the audiobook 'The Legend of Saint George'

Voice assistants

Catalan voice command recognition

For Catalan voice assistant systems

🚀 Wav2Vec2-Large-XLSR-Catalan

This model is fine-tuned from facebook/wav2vec2-large-xlsr-53 on the Catalan language, using the Common Voice and ParlamentParla datasets, aiming to provide high - quality automatic speech recognition for the Catalan language.

🚀 Quick Start

The model is fine - tuned from facebook/wav2vec2-large-xlsr-53 on the Catalan language, utilizing the Common Voice and ParlamentParla datasets.

⚠️ Important Note

The split train/dev/test used does not fully map with the CommonVoice 6.1 dataset. A custom split was used combining both the CommonVoice and ParlamentParla dataset and can be found here. Evaluating on the CV test dataset will produce a biased WER as 1144 audio files of that dataset were used in training/evaluation of this model. WER was calculated using this test.csv which was not seen by the model during training/evaluation.

You can find training and evaluation scripts in the github repository ccoreilly/wav2vec2-catala.

💡 Usage Tip

When using this model, make sure that your speech input is sampled at 16kHz.

✨ Features

Fine - tuned on Catalan: Leveraging the Common Voice and ParlamentParla datasets to adapt to the Catalan language.
High - quality speech recognition: Demonstrates good performance on multiple Catalan speech datasets.

📦 Installation

No specific installation steps are provided in the original document, so this section is skipped.

💻 Usage Examples

Basic Usage

import torch
import torchaudio
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

test_dataset = load_dataset("common_voice", "ca", split="test[:2%]")

processor = Wav2Vec2Processor.from_pretrained("ccoreilly/wav2vec2-large-xlsr-catala") 
model = Wav2Vec2ForCTC.from_pretrained("ccoreilly/wav2vec2-large-xlsr-catala")

resampler = torchaudio.transforms.Resample(48_000, 16_000)

# Preprocessing the datasets.
# We need to read the audio files as arrays
def speech_file_to_array_fn(batch):
	speech_array, sampling_rate = torchaudio.load(batch["path"])
	batch["speech"] = resampler(speech_array).squeeze().numpy()
	return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)
inputs = processor(test_dataset["speech"][:2], sampling_rate=16_000, return_tensors="pt", padding=True)

with torch.no_grad():
	logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits

predicted_ids = torch.argmax(logits, dim=-1)

print("Prediction:", processor.batch_decode(predicted_ids))
print("Reference:", test_dataset["sentence"][:2])

📚 Documentation

Results

Word error rate was evaluated on the following datasets unseen by the model:

Dataset	WER
Test split CV+ParlamentParla	6.92%
Google Crowsourced Corpus	12.99%
Audiobook “La llegenda de Sant Jordi”	13.23%

🔧 Technical Details

No specific technical details (more than 50 - word technical descriptions) are provided in the original document, so this section is skipped.

📄 License

The model is licensed under the apache - 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご