๐ Wav2Vec2-Large-XLSR-Catalan
This model is fine-tuned from facebook/wav2vec2-large-xlsr-53 on the Catalan language, using the Common Voice and ParlamentParla datasets, aiming to provide high - quality automatic speech recognition for the Catalan language.
๐ Quick Start
The model is fine - tuned from facebook/wav2vec2-large-xlsr-53 on the Catalan language, utilizing the Common Voice and ParlamentParla datasets.
โ ๏ธ Important Note
The split train/dev/test used does not fully map with the CommonVoice 6.1 dataset. A custom split was used combining both the CommonVoice and ParlamentParla dataset and can be found here. Evaluating on the CV test dataset will produce a biased WER as 1144 audio files of that dataset were used in training/evaluation of this model. WER was calculated using this test.csv which was not seen by the model during training/evaluation.
You can find training and evaluation scripts in the github repository ccoreilly/wav2vec2-catala.
๐ก Usage Tip
When using this model, make sure that your speech input is sampled at 16kHz.
โจ Features
- Fine - tuned on Catalan: Leveraging the Common Voice and ParlamentParla datasets to adapt to the Catalan language.
- High - quality speech recognition: Demonstrates good performance on multiple Catalan speech datasets.
๐ฆ Installation
No specific installation steps are provided in the original document, so this section is skipped.
๐ป Usage Examples
Basic Usage
import torch
import torchaudio
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
test_dataset = load_dataset("common_voice", "ca", split="test[:2%]")
processor = Wav2Vec2Processor.from_pretrained("ccoreilly/wav2vec2-large-xlsr-catala")
model = Wav2Vec2ForCTC.from_pretrained("ccoreilly/wav2vec2-large-xlsr-catala")
resampler = torchaudio.transforms.Resample(48_000, 16_000)
def speech_file_to_array_fn(batch):
speech_array, sampling_rate = torchaudio.load(batch["path"])
batch["speech"] = resampler(speech_array).squeeze().numpy()
return batch
test_dataset = test_dataset.map(speech_file_to_array_fn)
inputs = processor(test_dataset["speech"][:2], sampling_rate=16_000, return_tensors="pt", padding=True)
with torch.no_grad():
logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits
predicted_ids = torch.argmax(logits, dim=-1)
print("Prediction:", processor.batch_decode(predicted_ids))
print("Reference:", test_dataset["sentence"][:2])
๐ Documentation
Results
Word error rate was evaluated on the following datasets unseen by the model:
๐ง Technical Details
No specific technical details (more than 50 - word technical descriptions) are provided in the original document, so this section is skipped.
๐ License
The model is licensed under the apache - 2.0
license.