Bp500 Base100k Voxpopuli
B
Bp500 Base100k Voxpopuli
Developed by lgris
Speech recognition model optimized for Brazilian Portuguese, trained with 453 hours of audio from 7 public datasets
Downloads 23
Release Time : 3/2/2022
Model Overview
This model is a Brazilian Portuguese automatic speech recognition (ASR) system based on the Wav2vec 2.0 architecture, fine-tuned using multiple public datasets. It supports both language-model-free and 4-gram language model-enhanced modes.
Model Features
Multi-dataset Training
Combines 7 Brazilian Portuguese datasets (CETUC/Common Voice/MLS, etc.) totaling 453 hours of training data
Language Model Support
Optional 4-gram language model enhancement reduces average WER from 0.155 to 0.157
Cross-domain Adaptability
Stable performance across different scenarios such as read speech (CETUC) and spontaneous speech (TEDx)
Model Capabilities
Brazilian Portuguese speech-to-text conversion
Supports 16kHz sample rate audio processing
Batch speech recognition
Use Cases
Speech Transcription
Educational Content Transcription
Convert Portuguese teaching audio into text transcripts
Achieves WER as low as 0.052 on read speech datasets
Automated Meeting Minutes
Real-time transcription of Brazilian Portuguese meetings
WER around 0.317 on spontaneous speech datasets
Voice Assistants
Brazilian Portuguese Voice Command Recognition
Provides voice interaction support for localized smart devices
Excellent performance on short command datasets
🚀 bp500-base100k_voxpopuli: Wav2vec 2.0 with Brazilian Portuguese (BP) Dataset
This is a demonstration of a fine-tuned Wav2vec model for Brazilian Portuguese, utilizing the following datasets:
- CETUC: It contains approximately 145 hours of Brazilian Portuguese speech, distributed among 50 male and 50 female speakers. Each speaker pronounces around 1,000 phonetically balanced sentences selected from the CETEN-Folha corpus.
- Common Voice 7.0: A project proposed by the Mozilla Foundation, aiming to create a large open dataset in various languages. In this project, volunteers donate and validate speech via the official site.
- Lapsbm: "Falabrasil - UFPA" is a dataset used by the Fala Brasil group to benchmark ASR systems in Brazilian Portuguese. It consists of 35 speakers (10 females), each pronouncing 20 unique sentences, totaling 700 utterances in Brazilian Portuguese. The audios were recorded at 22.05 kHz without environmental control.
- Multilingual Librispeech (MLS): A massive dataset available in multiple languages. The MLS is based on audiobook recordings in the public domain, such as LibriVox. The dataset contains a total of 6k hours of transcribed data in many languages. The Portuguese set used in this work (mostly the Brazilian variant) has approximately 284 hours of speech, obtained from 55 audiobooks read by 62 speakers.
- Multilingual TEDx: A collection of audio recordings from TEDx talks in 8 source languages. The Portuguese set (mostly the Brazilian Portuguese variant) contains 164 hours of transcribed speech.
- Sidney (SID): It includes 5,777 utterances recorded by 72 speakers (20 women) aged from 17 to 59, with details such as place of birth, age, gender, education, and occupation.
- VoxForge: A project aiming to build open datasets for acoustic models. The corpus contains around 100 speakers and 4,130 utterances of Brazilian Portuguese, with sample rates ranging from 16kHz to 44.1kHz.
These datasets were combined to create a larger Brazilian Portuguese dataset. All data was used for training, except for the Common Voice dev/test sets, which were used for validation and testing respectively. We also created test sets for all the collected datasets.
Dataset | Train | Valid | Test |
---|---|---|---|
CETUC | 94.0h | -- | 5.4h |
Common Voice | 37.8h | 8.9h | 9.5h |
LaPS BM | 0.8h | -- | 0.1h |
MLS | 161.0h | -- | 3.7h |
Multilingual TEDx (Portuguese) | 148.9h | -- | 1.8h |
SID | 7.2h | -- | 1.0h |
VoxForge | 3.9h | -- | 0.1h |
Total | 453.6h | 8.9h | 21.6h |
The original model was fine-tuned using fairseq. This notebook uses a converted version of the original model. The link to the original fairseq model is available here.
Summary
CETUC | CV | LaPS | MLS | SID | TEDx | VF | AVG | |
---|---|---|---|---|---|---|---|---|
bp_500-base100k_voxpopuli (demonstration below) | 0.142 | 0.201 | 0.052 | 0.224 | 0.102 | 0.317 | 0.048 | 0.155 |
bp_500-base100k_voxpopuli + 4-gram (demonstration below) | 0.099 | 0.149 | 0.047 | 0.192 | 0.115 | 0.371 | 0.127 | 0.157 |
Transcription examples
Text | Transcription |
---|---|
qual o instagram dele | qualo está gramedele |
o capitão foi expulso do exército porque era doido | o capitãl foi exposo do exército porque era doido |
também por que não | também porque não |
não existe tempo como o presente | não existe tempo como o presente |
eu pulei para salvar rachel | eu pulei para salvar haquel |
augusto cezar passos marinho | augusto cesa passoesmarinho |
✨ Features
- Utilizes multiple Brazilian Portuguese speech datasets for fine - tuning the Wav2vec 2.0 model.
- Provides a comprehensive demonstration of the model's performance on different datasets.
- Supports both basic and language model - enhanced speech recognition.
📦 Installation
Install dependencies
%%capture
!pip install torch==1.8.2+cu111 torchvision==0.9.2+cu111 torchaudio===0.8.2 -f https://download.pytorch.org/whl/lts/1.8/torch_lts.html
!pip install datasets
!pip install jiwer
!pip install transformers
!pip install soundfile
!pip install pyctcdecode
!pip install https://github.com/kpu/kenlm/archive/master.zip
Download datasets
%%capture
!gdown --id 1HFECzIizf-bmkQRLiQD0QVqcGtOG5upI
!mkdir bp_dataset
!unzip bp_dataset -d bp_dataset/
%cd bp_dataset
💻 Usage Examples
Basic Usage
MODEL_NAME = "lgris/bp500-base100k_voxpopuli"
Advanced Usage
Imports and dependencies
import jiwer
import torchaudio
from datasets import load_dataset, load_metric
from transformers import (
Wav2Vec2ForCTC,
Wav2Vec2Processor,
)
from pyctcdecode import build_ctcdecoder
import torch
import re
import sys
Helpers
chars_to_ignore_regex = '[\,\?\.\!\;\:\"]' # noqa: W605
def map_to_array(batch):
speech, _ = torchaudio.load(batch["path"])
batch["speech"] = speech.squeeze(0).numpy()
batch["sampling_rate"] = 16_000
batch["sentence"] = re.sub(chars_to_ignore_regex, '', batch["sentence"]).lower().replace("’", "'")
batch["target"] = batch["sentence"]
return batch
def calc_metrics(truths, hypos):
wers = []
mers = []
wils = []
for t, h in zip(truths, hypos):
try:
wers.append(jiwer.wer(t, h))
mers.append(jiwer.mer(t, h))
wils.append(jiwer.wil(t, h))
except: # Empty string?
pass
wer = sum(wers)/len(wers)
mer = sum(mers)/len(mers)
wil = sum(wils)/len(wils)
return wer, mer, wil
def load_data(dataset):
data_files = {'test': f'{dataset}/test.csv'}
dataset = load_dataset('csv', data_files=data_files)["test"]
return dataset.map(map_to_array)
Model
class STT:
def __init__(self,
model_name,
device='cuda' if torch.cuda.is_available() else 'cpu',
lm=None):
self.model_name = model_name
self.model = Wav2Vec2ForCTC.from_pretrained(model_name).to(device)
self.processor = Wav2Vec2Processor.from_pretrained(model_name)
self.vocab_dict = self.processor.tokenizer.get_vocab()
self.sorted_dict = {
k.lower(): v for k, v in sorted(self.vocab_dict.items(),
key=lambda item: item[1])
}
self.device = device
self.lm = lm
if self.lm:
self.lm_decoder = build_ctcdecoder(
list(self.sorted_dict.keys()),
self.lm
)
def batch_predict(self, batch):
features = self.processor(batch["speech"],
sampling_rate=batch["sampling_rate"][0],
padding=True,
return_tensors="pt")
input_values = features.input_values.to(self.device)
with torch.no_grad():
logits = self.model(input_values).logits
if self.lm:
logits = logits.cpu().numpy()
batch["predicted"] = []
for sample_logits in logits:
batch["predicted"].append(self.lm_decoder.decode(sample_logits))
else:
pred_ids = torch.argmax(logits, dim=-1)
batch["predicted"] = self.processor.batch_decode(pred_ids)
return batch
Tests
stt = STT(MODEL_NAME)
CETUC
ds = load_data('cetuc_dataset')
result = ds.map(stt.batch_predict, batched=True, batch_size=8)
wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
print("CETUC WER:", wer)
Common Voice
ds = load_data('commonvoice_dataset')
result = ds.map(stt.batch_predict, batched=True, batch_size=8)
wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
print("CV WER:", wer)
LaPS
ds = load_data('lapsbm_dataset')
result = ds.map(stt.batch_predict, batched=True, batch_size=8)
wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
print("Laps WER:", wer)
MLS
ds = load_data('mls_dataset')
result = ds.map(stt.batch_predict, batched=True, batch_size=8)
wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
print("MLS WER:", wer)
SID
ds = load_data('sid_dataset')
result = ds.map(stt.batch_predict, batched=True, batch_size=8)
wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
print("Sid WER:", wer)
TEDx
ds = load_data('tedx_dataset')
result = ds.map(stt.batch_predict, batched=True, batch_size=8)
wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
print("TEDx WER:", wer)
VoxForge
ds = load_data('voxforge_dataset')
result = ds.map(stt.batch_predict, batched=True, batch_size=8)
wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
print("VoxForge WER:", wer)
Tests with LM
!rm -rf ~/.cache
!gdown --id 1GJIKseP5ZkTbllQVgOL98R4yYAcIySFP # trained with wikipedia
stt = STT(MODEL_NAME, lm='pt-BR-wiki.word.4-gram.arpa')
# !gdown --id 1dLFldy7eguPtyJj5OAlI4Emnx0BpFywg # trained with bp
# stt = STT(MODEL_NAME, lm='pt-BR.word.4-gram.arpa')
Cetuc
ds = load_data('cetuc_dataset')
result = ds.map(stt.batch_predict, batched=True, batch_size=8)
wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
print("CETUC WER:", wer)
Common Voice
ds = load_data('commonvoice_dataset')
result = ds.map(stt.batch_predict, batched=True, batch_size=8)
wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
print("CV WER:", wer)
LaPS
ds = load_data('lapsbm_dataset')
result = ds.map(stt.batch_predict, batched=True, batch_size=8)
wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
print("Laps WER:", wer)
MLS
ds = load_data('mls_dataset')
result = ds.map(stt.batch_predict, batched=True, batch_size=8)
wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
print("MLS WER:", wer)
SID
ds = load_data('sid_dataset')
result = ds.map(stt.batch_predict, batched=True, batch_size=8)
wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
print("Sid WER:", wer)
TEDx
ds = load_data('tedx_dataset')
result = ds.map(stt.batch_predict, batched=True, batch_size=8)
wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
print("TEDx WER:", wer)
VoxForge
ds = load_data('voxforge_dataset')
result = ds.map(stt.batch_predict, batched=True, batch_size=8)
wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
print("VoxForge WER:", wer)
📄 License
This project is licensed under the Apache-2.0 license.
Voice Activity Detection
MIT
Voice activity detection model based on pyannote.audio 2.1, used to identify speech activity segments in audio
Speech Recognition
V
pyannote
7.7M
181
Wav2vec2 Large Xlsr 53 Portuguese
Apache-2.0
This is a fine-tuned XLSR-53 large model for Portuguese speech recognition tasks, trained on the Common Voice 6.1 dataset, supporting Portuguese speech-to-text conversion.
Speech Recognition Other
W
jonatasgrosman
4.9M
32
Whisper Large V3
Apache-2.0
Whisper is an advanced automatic speech recognition (ASR) and speech translation model proposed by OpenAI, trained on over 5 million hours of labeled data, with strong cross-dataset and cross-domain generalization capabilities.
Speech Recognition Supports Multiple Languages
W
openai
4.6M
4,321
Whisper Large V3 Turbo
MIT
Whisper is a state-of-the-art automatic speech recognition (ASR) and speech translation model developed by OpenAI, trained on over 5 million hours of labeled data, demonstrating strong generalization capabilities in zero-shot settings.
Speech Recognition
Transformers Supports Multiple Languages

W
openai
4.0M
2,317
Wav2vec2 Large Xlsr 53 Russian
Apache-2.0
A Russian speech recognition model fine-tuned from facebook/wav2vec2-large-xlsr-53, supporting 16kHz sampled audio input
Speech Recognition Other
W
jonatasgrosman
3.9M
54
Wav2vec2 Large Xlsr 53 Chinese Zh Cn
Apache-2.0
A Chinese speech recognition model fine-tuned based on facebook/wav2vec2-large-xlsr-53, supporting 16kHz sampling rate audio input.
Speech Recognition Chinese
W
jonatasgrosman
3.8M
110
Wav2vec2 Large Xlsr 53 Dutch
Apache-2.0
A Dutch speech recognition model fine-tuned based on facebook/wav2vec2-large-xlsr-53, trained on the Common Voice and CSS10 datasets, supporting 16kHz audio input.
Speech Recognition Other
W
jonatasgrosman
3.0M
12
Wav2vec2 Large Xlsr 53 Japanese
Apache-2.0
Japanese speech recognition model fine-tuned from facebook/wav2vec2-large-xlsr-53, supporting 16kHz sampling rate audio input
Speech Recognition Japanese
W
jonatasgrosman
2.9M
33
Mms 300m 1130 Forced Aligner
A text-to-audio forced alignment tool based on Hugging Face pre-trained models, supporting multiple languages with high memory efficiency
Speech Recognition
Transformers Supports Multiple Languages

M
MahmoudAshraf
2.5M
50
Wav2vec2 Large Xlsr 53 Arabic
Apache-2.0
Arabic speech recognition model fine-tuned from facebook/wav2vec2-large-xlsr-53, trained on Common Voice and Arabic speech corpus
Speech Recognition Arabic
W
jonatasgrosman
2.3M
37
Featured Recommended AI Models