wav2vec2-large-100k-voxpopuli-catala Open Source Model - Accurately Recognize Catalan Speech

Wav2vec2 Large 100k Voxpopuli Catala

Developed by ccoreilly

A Catalan speech recognition model fine-tuned based on facebook/wav2vec2-large-100k-voxpopuli

Speech Recognition OtherOpen Source License:Apache-2.0 #Catalan speech recognition #Low word error rate #Parliament speech optimization

Downloads 56

Release Time : 3/2/2022

Model Overview

This is an automatic speech recognition (ASR) model for Catalan, fine-tuned using the Common Voice and ParlamentParla datasets, capable of converting Catalan speech to text.

Model Features

Multi-dataset training

Trained using both Common Voice and ParlamentParla datasets to enhance model generalization

Low word error rate

Achieves a word error rate (WER) of 5.98% on the test set, demonstrating excellent performance

16kHz sampling rate support

Specially optimized to support 16kHz sampling rate audio input

Model Capabilities

Catalan speech recognition

Speech-to-text

Automatic speech recognition

Use Cases

Speech transcription

Parliament speech transcription

Convert recordings of Catalan parliament speeches into text transcripts

Performs well on the ParlamentParla dataset

Voice assistants

Provide speech recognition capabilities for Catalan voice assistants

Education

Language learning applications

Used for pronunciation assessment features in Catalan language learning apps

🚀 Wav2Vec2-Large-100k-VoxPopuli-Català

This model is a fine - tuned version of facebook/wav2vec2-large-100k-voxpopuli on the Catalan language, aiming to provide high - quality automatic speech recognition for Catalan.

🚀 Quick Start

⚠️NOTICE⚠️: THIS MODEL HAS BEEN MOVED TO THE FOLLOWING URL: https://huggingface.co/softcatala/wav2vec2-large-100k-voxpopuli-catala

This model is fine - tuned on the Common Voice and ParlamentParla datasets for the Catalan language.

⚠️ Important Note

The split train/dev/test used does not fully map with the CommonVoice 6.1 dataset. A custom split was used combining both the CommonVoice and ParlamentParla dataset and can be found here. Evaluating on the CV test dataset will produce a biased WER as 1144 audio files of that dataset were used in training/evaluation of this model. WER was calculated using this test.csv which was not seen by the model during training/evaluation.

💡 Usage Tip

When using this model, make sure that your speech input is sampled at 16kHz.

✨ Features

Multidataset Training: Trained on multiple datasets including Common Voice and ParlamentParla to enhance the model's generalization ability.
High - Quality Speech Recognition: Achieves relatively low word error rates on various Catalan speech datasets.

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

import torch
import torchaudio
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

test_dataset = load_dataset("common_voice", "ca", split="test[:2%]")

processor = Wav2Vec2Processor.from_pretrained("ccoreilly/wav2vec2-large-100k-voxpopuli-catala") 
model = Wav2Vec2ForCTC.from_pretrained("ccoreilly/wav2vec2-large-100k-voxpopuli-catala")

resampler = torchaudio.transforms.Resample(48_000, 16_000)

# Preprocessing the datasets.
# We need to read the audio files as arrays
def speech_file_to_array_fn(batch):
	speech_array, sampling_rate = torchaudio.load(batch["path"])
	batch["speech"] = resampler(speech_array).squeeze().numpy()
	return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)
inputs = processor(test_dataset["speech"][:2], sampling_rate=16_000, return_tensors="pt", padding=True)

with torch.no_grad():
	logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits

predicted_ids = torch.argmax(logits, dim=-1)

print("Prediction:", processor.batch_decode(predicted_ids))
print("Reference:", test_dataset["sentence"][:2])

📚 Documentation

Model Information

Property	Details
Language	Catalan
Datasets	Common Voice, ParlamentParla
Metrics	WER (Word Error Rate)
Tags	audio, automatic - speech - recognition, speech, speech - to - text
License	apache - 2.0

Model Index

Name: Catalan VoxPopuli Wav2Vec2 Large
Results:
- Task:
  - Name: Speech Recognition
  - Type: automatic - speech - recognition
- Datasets:
  - Name: Common Voice ca
  - Type: common_voice
  - Args: ca
  - Name: ParlamentParla
  - URL: https://www.openslr.org/59/
- Metrics:
  - Name: Test WER
  - Type: wer
  - Value: 5.98
  - Name: Google Crowsourced Corpus WER
  - Type: wer
  - Value: 12.14
  - Name: Audiobook “La llegenda de Sant Jordi” WER
  - Type: wer
  - Value: 12.02

Results

Word error rate was evaluated on the following datasets unseen by the model:

Dataset	WER
Test split CV+ParlamentParla	5.98%
Google Crowsourced Corpus	12.14%
Audiobook “La llegenda de Sant Jordi”	12.02%

📄 License

This model is licensed under the apache - 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご