wav2vec2-large-100k-voxpopuli-catala Open-source Speech Recognition Model

Wav2vec2 Large 100k Voxpopuli Catala

Developed by softcatala

A Catalan speech recognition model fine-tuned based on the VoxPopuli large model, trained on Common Voice and ParlamentParla datasets

Speech Recognition OtherOpen Source License:Apache-2.0 #Catalan speech recognition #Low word error rate #Parliament speech adaptation

Downloads 16

Release Time : 3/2/2022

Model Overview

This is an automatic speech recognition (ASR) model optimized for Catalan, capable of converting Catalan speech into text.

Model Features

Multi-dataset training

Combines Common Voice and ParlamentParla datasets for training, improving model generalization

Low word error rate

Achieves a 5.98% word error rate on test sets, demonstrating excellent performance

No language model required

Can be used directly without additional language model support

Model Capabilities

Speech recognition

Speech-to-text

Catalan language processing

Use Cases

Speech transcription

Parliament speech transcription

Convert recordings of Catalan parliamentary speeches into text

Performs well on the ParlamentParla dataset

Audiobook transcription

Convert Catalan audiobooks into text

Achieved a 12.02% word error rate in 'The Legend of Saint George' test

Voice assistants

Catalan voice command recognition

Speech recognition component for Catalan voice assistant systems

🚀 Wav2Vec2-Large-100k-VoxPopuli-Catalan

This model is fine - tuned from facebook/wav2vec2-large-100k-voxpopuli on the Catalan language, leveraging the Common Voice and ParlamentParla datasets.

📋 Model Information

Property	Details
Model Type	Wav2Vec2-Large-100k-VoxPopuli-Catalan
Training Data	Common Voice, ParlamentParla
Metrics	WER
Tags	audio, automatic - speech - recognition, speech, speech - to - text
License	apache - 2.0

📊 Model Index

Name: Catalan VoxPopuli Wav2Vec2 Large
Results:
- Task:
  - Name: Speech Recognition
  - Type: automatic - speech - recognition
- Datasets:
  - Name: Common Voice ca
    - Type: common_voice
    - Args: ca
  - Name: ParlamentParla
    - URL: https://www.openslr.org/59/
- Metrics:
  - Name: Test WER
    - Type: wer
    - Value: 5.98
  - Name: Google Crowsourced Corpus WER
    - Type: wer
    - Value: 12.14
  - Name: Audiobook “La llegenda de Sant Jordi” WER
    - Type: wer
    - Value: 12.02

🚀 Quick Start

This model is fine - tuned facebook/wav2vec2-large-100k-voxpopuli on the Catalan language using the Common Voice and ParlamentParla datasets.

⚠️ Important Note

The split train/dev/test used does not fully map with the CommonVoice 6.1 dataset. A custom split was used combining both the CommonVoice and ParlamentParla dataset and can be found here. Evaluating on the CV test dataset will produce a biased WER as 1144 audio files of that dataset were used in training/evaluation of this model. WER was calculated using this test.csv which was not seen by the model during training/evaluation.

💡 Usage Tip

When using this model, make sure that your speech input is sampled at 16kHz.

You can find training and evaluation scripts in the github repository ccoreilly/wav2vec2-catala.

✨ Features

Fine - tuned on Catalan language datasets.
Provides word error rate (WER) metrics on multiple datasets.

📦 Installation

No specific installation steps are provided in the original document, so this section is skipped.

💻 Usage Examples

Basic Usage

import torch
import torchaudio
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

test_dataset = load_dataset("common_voice", "ca", split="test[:2%]")

processor = Wav2Vec2Processor.from_pretrained("ccoreilly/wav2vec2-large-100k-voxpopuli-catala") 
model = Wav2Vec2ForCTC.from_pretrained("ccoreilly/wav2vec2-large-100k-voxpopuli-catala")

resampler = torchaudio.transforms.Resample(48_000, 16_000)

# Preprocessing the datasets.
# We need to read the audio files as arrays
def speech_file_to_array_fn(batch):
	speech_array, sampling_rate = torchaudio.load(batch["path"])
	batch["speech"] = resampler(speech_array).squeeze().numpy()
	return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)
inputs = processor(test_dataset["speech"][:2], sampling_rate=16_000, return_tensors="pt", padding=True)

with torch.no_grad():
	logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits

predicted_ids = torch.argmax(logits, dim=-1)

print("Prediction:", processor.batch_decode(predicted_ids))
print("Reference:", test_dataset["sentence"][:2])

📚 Documentation

The model's performance is evaluated on the following datasets unseen by the model:

Dataset	WER
Test split CV+ParlamentParla	5.98%
Google Crowsourced Corpus	12.14%
Audiobook “La llegenda de Sant Jordi”	12.02%

📄 License

This model is licensed under the apache - 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご