๐ whisper-large-v3-ca-3catparla
This is an acoustic model for Automatic Speech Recognition in Catalan. It's based on finetuning a well - known model with Catalan data, aiming to transcribe Catalan audio to text effectively.
๐ Table of Contents
Click to expand
๐ Paper
PDF: 3CatParla: A New Open - Source Corpus of Broadcast TV in Catalan for Automatic Speech Recognition
โจ Model Description
The "whisper-large-v3-ca-3catparla" is an acoustic model designed for Automatic Speech Recognition in Catalan. It is developed by finetuning the model "openai/whisper-large-v3" with 710 hours of Catalan data released by the Projecte AINA from Barcelona, Spain.
โ ๏ธ Intended Uses and Limitations
This model can be utilized for Automatic Speech Recognition (ASR) in Catalan. Its main purpose is to transcribe Catalan audio files into plain text without punctuation.
๐ How to Get Started with the Model
To view an updated and functional version of this code, please visit our Notebook
๐ฆ Installation
To use this model, you need to install datasets and transformers:
Create a virtual environment:
python -m venv /path/to/venv
Activate the environment:
source /path/to/venv/bin/activate
Install the modules:
pip install datasets transformers
๐ป For Inference
To transcribe Catalan audio using this model, you can follow this example:
pip install torch
pip install datasets
pip install 'transformers[torch]'
pip install evaluate
pip install jiwer
import torch
from transformers import WhisperForConditionalGeneration, WhisperProcessor
MODEL_NAME="projecte-aina/whisper-large-v3-ca-3catparla"
processor = WhisperProcessor.from_pretrained(MODEL_NAME)
model = WhisperForConditionalGeneration.from_pretrained(MODEL_NAME).to("cuda")
from datasets import load_dataset, load_metric, Audio
ds=load_dataset("projecte-aina/3catparla_asr",split='test')
ds = ds.cast_column("audio", Audio(sampling_rate=16_000))
def map_to_pred(batch):
audio = batch["audio"]
input_features = processor(audio["array"], sampling_rate=audio["sampling_rate"], return_tensors="pt").input_features
batch["reference"] = processor.tokenizer._normalize(batch['normalized_text'])
with torch.no_grad():
predicted_ids = model.generate(input_features.to("cuda"))[0]
transcription = processor.decode(predicted_ids)
batch["prediction"] = processor.tokenizer._normalize(transcription)
return batch
result = ds.map(map_to_pred)
from evaluate import load
wer = load("wer")
WER=100 * wer.compute(references=result["reference"], predictions=result["prediction"])
print(WER)
Test Result: 0.96
๐ง Training Details
๐ Training data
The specific dataset used to create the model is "3CatParla".
๐ Training procedure
This model is obtained by finetuning the model "openai/whisper-large-v3" following this tutorial provided by Hugging Face.
โ๏ธ Training Hyperparameters
Property |
Details |
Language |
Catalan |
Hours of training audio |
710 |
Learning rate |
1.95e - 07 |
Sample rate |
16000 |
Train batch size |
32 (x4 GPUs) |
Gradient accumulation steps |
1 |
Eval batch size |
32 |
Save total limit |
3 |
Max steps |
19842 |
Warmup steps |
1984 |
Eval steps |
3307 |
Save steps |
3307 |
Shuffle buffer size |
480 |
๐ Citation
If this model contributes to your research, please cite the work:
@inproceedings{hernandez20243catparla,
title={3CatParla: A New Open-Source Corpus of Broadcast TV in Catalan for Automatic Speech Recognition},
author={Hern{\'a}ndez Mena, Carlos Daniel and Armentano Oller, Carme and Solito, Sarah and K{\"u}lebi, Baybars},
booktitle={Proc. IberSPEECH 2024},
pages={176--180},
year={2024}
}
โน๏ธ Additional Information
๐ค Author
The fine - tuning process was carried out in July 2024 in the Language Technologies Unit of the Barcelona Supercomputing Center by Carlos Daniel Hernรกndez Mena.
๐ง Contact
For more information, please send an email to langtech@bsc.es.
ยฉ๏ธ Copyright
Copyright(c) 2024 by Language Technologies Unit, Barcelona Supercomputing Center.
๐ License
Apache - 2.0
๐ฐ Funding
This work has been promoted and financed by the Generalitat de Catalunya through the Aina project.
The training of the model was made possible thanks to the compute time provided by Barcelona Supercomputing Center through MareNostrum 5.