language:
- pt
license: apache-2.0
tags:
- whisper-event
- generated_from_trainer
datasets:
- mozilla-foundation/common_voice_11_0
metrics:
- wer
model-index:
- name: Whisper Medium Portuguese
results:
- task:
name: Automatic Speech Recognition
type: automatic-speech-recognition
dataset:
name: mozilla-foundation/common_voice_11_0 pt
type: mozilla-foundation/common_voice_11_0
config: pt
split: test
args: pt
metrics:
- name: Wer
type: wer
value: 6.5785713084850626
Whisper Medium Portuguese 🇧🇷🇵🇹
Bem-vindo ao whisper medium para transcrição em português 👋🏻
If you are looking to quickly, and reliably, transcribe Portuguese audio to text, you are in the right place!
With a state-of-the-art Word Error Rate (WER) of just 6.579 in Common Voice 11, this model offers an x2 precision increase compared to prior state-of-the-art wav2vec2 models. Compared to the original whisper-medium model it delivers an x1.2 improvement 🚀.
This model is a fine-tuned version of openai/whisper-medium on the mozilla-foundation/common_voice_11 dataset.
The following table displays a comparison between the results of our model and those achieved by the most downloaded models in the hub for Portuguese Automatic Speech Recognition 🗣:
How to use
You can use this model directly with a pipeline. This is especially useful for short audio. For long-form transcriptions please use the code in the Long-form transcription section.
pip install git+https://github.com/huggingface/transformers --force-reinstall
pip install torch
>>> from transformers import pipeline
>>> import torch
>>> device = 0 if torch.cuda.is_available() else "cpu"
>>> transcribe = pipeline(
... task="automatic-speech-recognition",
... model="jlondonobo/whisper-medium-pt",
... chunk_length_s=30,
... device=device,
... )
>>> transcribe.model.config.forced_decoder_ids = transcribe.tokenizer.get_decoder_prompt_ids(language="pt", task="transcribe")
>>> transcribe("audio.m4a")["text"]
'Eu falo português.'
Long-form transcription
To improve the performance of long-form transcription you can convert the HF model into a whisper
model, and use the original paper's matching algorithm. To do this, you must install whisper
and a set of tools developed by @bayartsogt.
pip install git+https://github.com/openai/whisper.git
pip install git+https://github.com/bayartsogt-ya/whisper-multiple-hf-datasets
Then convert the HuggingFace model and transcribe:
>>> import torch
>>> import whisper
>>> from multiple_datasets.hub_default_utils import convert_hf_whisper
>>> device = "cuda" if torch.cuda.is_available() else "cpu"
>>> convert_hf_whisper("jlondonobo/whisper-medium-pt", "local_whisper_model.pt")
>>> model = whisper.load_model("local_whisper_model.pt", device=device)
>>> model.transcribe("long_audio.m4a", language="pt")["text"]
'Olá eu sou o José. Tenho 23 anos e trabalho...'
Training hyperparameters
We used the following hyperparameters for training:
learning_rate
: 1e-05
train_batch_size
: 32
eval_batch_size
: 16
seed
: 42
optimizer
: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type
: linear
lr_scheduler_warmup_steps
: 500
training_steps
: 5000
mixed_precision_training
: Native AMP
Training results
Training Loss |
Epoch |
Step |
Validation Loss |
Wer |
0.0698 |
1.09 |
1000 |
0.1876 |
7.189 |
0.0218 |
3.07 |
2000 |
0.2254 |
7.110 |
0.0053 |
5.06 |
3000 |
0.2711 |
6.969 |
0.0017 |
7.04 |
4000 |
0.3030 |
6.686 |
0.0005 |
9.02 |
5000 |
0.3205 |
6.579 🤗 |
Framework versions
- Transformers 4.26.0.dev0
- Pytorch 1.13.0+cu117
- Datasets 2.7.1.dev0
- Tokenizers 0.13.2