๐ Wav2Vec2-Large-100k-VoxPopuli-Catalan
This model is fine - tuned from facebook/wav2vec2-large-100k-voxpopuli on the Catalan language, leveraging the Common Voice and ParlamentParla datasets.
๐ Model Information
Property |
Details |
Model Type |
Wav2Vec2-Large-100k-VoxPopuli-Catalan |
Training Data |
Common Voice, ParlamentParla |
Metrics |
WER |
Tags |
audio, automatic - speech - recognition, speech, speech - to - text |
License |
apache - 2.0 |
๐ Model Index
- Name: Catalan VoxPopuli Wav2Vec2 Large
- Results:
- Task:
- Name: Speech Recognition
- Type: automatic - speech - recognition
- Datasets:
- Name: Common Voice ca
- Type: common_voice
- Args: ca
- Name: ParlamentParla
- URL: https://www.openslr.org/59/
- Metrics:
- Name: Test WER
- Name: Google Crowsourced Corpus WER
- Name: Audiobook โLa llegenda de Sant Jordiโ WER
๐ Quick Start
This model is fine - tuned facebook/wav2vec2-large-100k-voxpopuli on the Catalan language using the Common Voice and ParlamentParla datasets.
โ ๏ธ Important Note
The split train/dev/test used does not fully map with the CommonVoice 6.1 dataset. A custom split was used combining both the CommonVoice and ParlamentParla dataset and can be found here. Evaluating on the CV test dataset will produce a biased WER as 1144 audio files of that dataset were used in training/evaluation of this model. WER was calculated using this test.csv which was not seen by the model during training/evaluation.
๐ก Usage Tip
When using this model, make sure that your speech input is sampled at 16kHz.
You can find training and evaluation scripts in the github repository ccoreilly/wav2vec2-catala.
โจ Features
- Fine - tuned on Catalan language datasets.
- Provides word error rate (WER) metrics on multiple datasets.
๐ฆ Installation
No specific installation steps are provided in the original document, so this section is skipped.
๐ป Usage Examples
Basic Usage
import torch
import torchaudio
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
test_dataset = load_dataset("common_voice", "ca", split="test[:2%]")
processor = Wav2Vec2Processor.from_pretrained("ccoreilly/wav2vec2-large-100k-voxpopuli-catala")
model = Wav2Vec2ForCTC.from_pretrained("ccoreilly/wav2vec2-large-100k-voxpopuli-catala")
resampler = torchaudio.transforms.Resample(48_000, 16_000)
def speech_file_to_array_fn(batch):
speech_array, sampling_rate = torchaudio.load(batch["path"])
batch["speech"] = resampler(speech_array).squeeze().numpy()
return batch
test_dataset = test_dataset.map(speech_file_to_array_fn)
inputs = processor(test_dataset["speech"][:2], sampling_rate=16_000, return_tensors="pt", padding=True)
with torch.no_grad():
logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits
predicted_ids = torch.argmax(logits, dim=-1)
print("Prediction:", processor.batch_decode(predicted_ids))
print("Reference:", test_dataset["sentence"][:2])
๐ Documentation
The model's performance is evaluated on the following datasets unseen by the model:
๐ License
This model is licensed under the apache - 2.0 license.