whisper-large-v3-ca-3catparla Open-source Speech Recognition Model

Whisper Large V3 Ca 3catparla

Developed by projecte-aina

This is an automatic speech recognition model optimized for Catalan, fine-tuned based on OpenAI's Whisper-large-v3 and developed by the Barcelona Supercomputing Center.

Speech Recognition

Transformers

OtherOpen Source License:Apache-2.0 #Catalan speech recognition #Low WER transcription #Broadcast audio processing

Downloads 122

Release Time : 8/5/2024

Model Overview

This model is specifically designed for automatic speech recognition tasks in Catalan, capable of converting Catalan audio into unpunctuated plain text.

Model Features

High-precision Catalan recognition

Achieves a WER (Word Error Rate) of 0.96 on the 3CatParla test set

Multi-dialect support

Capable of recognizing different dialect variants of Catalan

Large-scale training data

Fine-tuned using 710 hours of Catalan data

Model Capabilities

Catalan audio transcription

Automatic speech recognition

Supports 16kHz sample rate audio processing

Use Cases

Speech transcription

Broadcast content transcription

Automatically transcribes Catalan broadcast programs into text

Achieves a WER of 0.96 on the 3CatParla test set

Dialect speech recognition

Recognizes different regional dialects of Catalan

WER ranges between 7.88-12.25 on different dialect test sets

🚀 whisper-large-v3-ca-3catparla

This is an acoustic model for Automatic Speech Recognition in Catalan. It's based on finetuning a well - known model with Catalan data, aiming to transcribe Catalan audio to text effectively.

📚 Table of Contents

Click to expand

Paper
Model Description
Intended Uses and Limitations
How to Get Started with the Model
Training Details
Citation
Additional Information

📄 Paper

PDF: 3CatParla: A New Open - Source Corpus of Broadcast TV in Catalan for Automatic Speech Recognition

✨ Model Description

The "whisper-large-v3-ca-3catparla" is an acoustic model designed for Automatic Speech Recognition in Catalan. It is developed by finetuning the model "openai/whisper-large-v3" with 710 hours of Catalan data released by the Projecte AINA from Barcelona, Spain.

⚠️ Intended Uses and Limitations

This model can be utilized for Automatic Speech Recognition (ASR) in Catalan. Its main purpose is to transcribe Catalan audio files into plain text without punctuation.

🚀 How to Get Started with the Model

To view an updated and functional version of this code, please visit our Notebook

📦 Installation

To use this model, you need to install datasets and transformers:

Create a virtual environment:

python -m venv /path/to/venv

Activate the environment:

source /path/to/venv/bin/activate

Install the modules:

pip install datasets transformers

💻 For Inference

To transcribe Catalan audio using this model, you can follow this example:

#Install Prerequisites
pip install torch
pip install datasets
pip install 'transformers[torch]'
pip install evaluate
pip install jiwer

#This code works with GPU

#Notice that: load_metric is no longer part of datasets.
#you have to remove it and use evaluate's load instead.
#(Note from November 2024)

import torch
from transformers import WhisperForConditionalGeneration, WhisperProcessor

#Load the processor and model.
MODEL_NAME="projecte-aina/whisper-large-v3-ca-3catparla"
processor = WhisperProcessor.from_pretrained(MODEL_NAME)
model = WhisperForConditionalGeneration.from_pretrained(MODEL_NAME).to("cuda")

#Load the dataset
from datasets import load_dataset, load_metric, Audio
ds=load_dataset("projecte-aina/3catparla_asr",split='test')

#Downsample to 16kHz
ds = ds.cast_column("audio", Audio(sampling_rate=16_000))

#Process the dataset
def map_to_pred(batch):
    audio = batch["audio"]
    input_features = processor(audio["array"], sampling_rate=audio["sampling_rate"], return_tensors="pt").input_features
    batch["reference"] = processor.tokenizer._normalize(batch['normalized_text'])

    with torch.no_grad():
        predicted_ids = model.generate(input_features.to("cuda"))[0]
    
    transcription = processor.decode(predicted_ids)
    batch["prediction"] = processor.tokenizer._normalize(transcription)
    
    return batch

#Do the evaluation
result = ds.map(map_to_pred)

#Compute the overall WER now.
from evaluate import load

wer = load("wer")
WER=100 * wer.compute(references=result["reference"], predictions=result["prediction"])
print(WER)

Test Result: 0.96

🔧 Training Details

📊 Training data

The specific dataset used to create the model is "3CatParla".

📋 Training procedure

This model is obtained by finetuning the model "openai/whisper-large-v3" following this tutorial provided by Hugging Face.

⚙️ Training Hyperparameters

Property	Details
Language	Catalan
Hours of training audio	710
Learning rate	1.95e - 07
Sample rate	16000
Train batch size	32 (x4 GPUs)
Gradient accumulation steps	1
Eval batch size	32
Save total limit	3
Max steps	19842
Warmup steps	1984
Eval steps	3307
Save steps	3307
Shuffle buffer size	480

📖 Citation

If this model contributes to your research, please cite the work:

@inproceedings{hernandez20243catparla,
  title={3CatParla: A New Open-Source Corpus of Broadcast TV in Catalan for Automatic Speech Recognition},
  author={Hern{\'a}ndez Mena, Carlos Daniel and Armentano Oller, Carme and Solito, Sarah and K{\"u}lebi, Baybars},
  booktitle={Proc. IberSPEECH 2024},
  pages={176--180},
  year={2024}
}

ℹ️ Additional Information

👤 Author

The fine - tuning process was carried out in July 2024 in the Language Technologies Unit of the Barcelona Supercomputing Center by Carlos Daniel Hernández Mena.

📧 Contact

For more information, please send an email to langtech@bsc.es.

©️ Copyright

📜 License

Apache - 2.0

💰 Funding

This work has been promoted and financed by the Generalitat de Catalunya through the Aina project.

The training of the model was made possible thanks to the compute time provided by Barcelona Supercomputing Center through MareNostrum 5.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご