bp500-base100k_voxpopuli Open-source Speech Recognition Model - Accurately Recognize Brazilian Portuguese

Home

Bp500 Base100k Voxpopuli

Developed by lgris

Speech recognition model optimized for Brazilian Portuguese, trained with 453 hours of audio from 7 public datasets

Speech Recognition

Transformers

OtherOpen Source License:Apache-2.0 #Brazilian Portuguese ASR #Multi-dataset fusion #Low WER

Downloads 23

Release Time : 3/2/2022

Model Overview

This model is a Brazilian Portuguese automatic speech recognition (ASR) system based on the Wav2vec 2.0 architecture, fine-tuned using multiple public datasets. It supports both language-model-free and 4-gram language model-enhanced modes.

Model Features

Multi-dataset Training

Combines 7 Brazilian Portuguese datasets (CETUC/Common Voice/MLS, etc.) totaling 453 hours of training data

Language Model Support

Optional 4-gram language model enhancement reduces average WER from 0.155 to 0.157

Cross-domain Adaptability

Stable performance across different scenarios such as read speech (CETUC) and spontaneous speech (TEDx)

Model Capabilities

Brazilian Portuguese speech-to-text conversion

Supports 16kHz sample rate audio processing

Batch speech recognition

Use Cases

Speech Transcription

Educational Content Transcription

Convert Portuguese teaching audio into text transcripts

Achieves WER as low as 0.052 on read speech datasets

Automated Meeting Minutes

Real-time transcription of Brazilian Portuguese meetings

WER around 0.317 on spontaneous speech datasets

Voice Assistants

Brazilian Portuguese Voice Command Recognition

Provides voice interaction support for localized smart devices

Excellent performance on short command datasets

🚀 bp500-base100k_voxpopuli: Wav2vec 2.0 with Brazilian Portuguese (BP) Dataset

This is a demonstration of a fine-tuned Wav2vec model for Brazilian Portuguese, utilizing the following datasets:

CETUC: It contains approximately 145 hours of Brazilian Portuguese speech, distributed among 50 male and 50 female speakers. Each speaker pronounces around 1,000 phonetically balanced sentences selected from the CETEN-Folha corpus.
Common Voice 7.0: A project proposed by the Mozilla Foundation, aiming to create a large open dataset in various languages. In this project, volunteers donate and validate speech via the official site.
Lapsbm: "Falabrasil - UFPA" is a dataset used by the Fala Brasil group to benchmark ASR systems in Brazilian Portuguese. It consists of 35 speakers (10 females), each pronouncing 20 unique sentences, totaling 700 utterances in Brazilian Portuguese. The audios were recorded at 22.05 kHz without environmental control.
Multilingual Librispeech (MLS): A massive dataset available in multiple languages. The MLS is based on audiobook recordings in the public domain, such as LibriVox. The dataset contains a total of 6k hours of transcribed data in many languages. The Portuguese set used in this work (mostly the Brazilian variant) has approximately 284 hours of speech, obtained from 55 audiobooks read by 62 speakers.
Multilingual TEDx: A collection of audio recordings from TEDx talks in 8 source languages. The Portuguese set (mostly the Brazilian Portuguese variant) contains 164 hours of transcribed speech.
Sidney (SID): It includes 5,777 utterances recorded by 72 speakers (20 women) aged from 17 to 59, with details such as place of birth, age, gender, education, and occupation.
VoxForge: A project aiming to build open datasets for acoustic models. The corpus contains around 100 speakers and 4,130 utterances of Brazilian Portuguese, with sample rates ranging from 16kHz to 44.1kHz.

These datasets were combined to create a larger Brazilian Portuguese dataset. All data was used for training, except for the Common Voice dev/test sets, which were used for validation and testing respectively. We also created test sets for all the collected datasets.

Dataset	Train	Valid	Test
CETUC	94.0h	--	5.4h
Common Voice	37.8h	8.9h	9.5h
LaPS BM	0.8h	--	0.1h
MLS	161.0h	--	3.7h
Multilingual TEDx (Portuguese)	148.9h	--	1.8h
SID	7.2h	--	1.0h
VoxForge	3.9h	--	0.1h
Total	453.6h	8.9h	21.6h

The original model was fine-tuned using fairseq. This notebook uses a converted version of the original model. The link to the original fairseq model is available here.

Summary

	CETUC	CV	LaPS	MLS	SID	TEDx	VF	AVG
bp_500-base100k_voxpopuli (demonstration below)	0.142	0.201	0.052	0.224	0.102	0.317	0.048	0.155
bp_500-base100k_voxpopuli + 4-gram (demonstration below)	0.099	0.149	0.047	0.192	0.115	0.371	0.127	0.157

Transcription examples

Text	Transcription
qual o instagram dele	qualo está gramedele
o capitão foi expulso do exército porque era doido	o capitãl foi exposo do exército porque era doido
também por que não	também porque não
não existe tempo como o presente	não existe tempo como o presente
eu pulei para salvar rachel	eu pulei para salvar haquel
augusto cezar passos marinho	augusto cesa passoesmarinho

✨ Features

Utilizes multiple Brazilian Portuguese speech datasets for fine - tuning the Wav2vec 2.0 model.
Provides a comprehensive demonstration of the model's performance on different datasets.
Supports both basic and language model - enhanced speech recognition.

📦 Installation

Install dependencies

%%capture
!pip install torch==1.8.2+cu111 torchvision==0.9.2+cu111 torchaudio===0.8.2 -f https://download.pytorch.org/whl/lts/1.8/torch_lts.html
!pip install datasets
!pip install jiwer
!pip install transformers
!pip install soundfile
!pip install pyctcdecode
!pip install https://github.com/kpu/kenlm/archive/master.zip

Download datasets

%%capture
!gdown --id 1HFECzIizf-bmkQRLiQD0QVqcGtOG5upI
!mkdir bp_dataset
!unzip bp_dataset -d bp_dataset/

%cd bp_dataset

💻 Usage Examples

Basic Usage

MODEL_NAME = "lgris/bp500-base100k_voxpopuli"

Advanced Usage

Imports and dependencies

import jiwer
import torchaudio
from datasets import load_dataset, load_metric
from transformers import (
    Wav2Vec2ForCTC,
    Wav2Vec2Processor,
)
from pyctcdecode import build_ctcdecoder
import torch
import re
import sys

Helpers

chars_to_ignore_regex = '[\,\?\.\!\;\:\"]'  # noqa: W605

def map_to_array(batch):
    speech, _ = torchaudio.load(batch["path"])
    batch["speech"] = speech.squeeze(0).numpy() 
    batch["sampling_rate"] = 16_000 
    batch["sentence"] = re.sub(chars_to_ignore_regex, '', batch["sentence"]).lower().replace("’", "'")
    batch["target"] = batch["sentence"]
    return batch

def calc_metrics(truths, hypos):
    wers = []
    mers = []
    wils = []
    for t, h in zip(truths, hypos):
        try:
            wers.append(jiwer.wer(t, h))
            mers.append(jiwer.mer(t, h))
            wils.append(jiwer.wil(t, h))
        except: # Empty string?
            pass
    wer = sum(wers)/len(wers)
    mer = sum(mers)/len(mers)
    wil = sum(wils)/len(wils)
    return wer, mer, wil

def load_data(dataset):
    data_files = {'test': f'{dataset}/test.csv'}
    dataset = load_dataset('csv', data_files=data_files)["test"]
    return dataset.map(map_to_array)

Model

class STT:

    def __init__(self, 
                 model_name, 
                 device='cuda' if torch.cuda.is_available() else 'cpu', 
                 lm=None):
        self.model_name = model_name
        self.model = Wav2Vec2ForCTC.from_pretrained(model_name).to(device)
        self.processor = Wav2Vec2Processor.from_pretrained(model_name)
        self.vocab_dict = self.processor.tokenizer.get_vocab()
        self.sorted_dict = {
            k.lower(): v for k, v in sorted(self.vocab_dict.items(), 
                                            key=lambda item: item[1])
        }
        self.device = device
        self.lm = lm
        if self.lm:            
            self.lm_decoder = build_ctcdecoder(
                list(self.sorted_dict.keys()),
                self.lm
            )

    def batch_predict(self, batch):
        features = self.processor(batch["speech"], 
                                  sampling_rate=batch["sampling_rate"][0], 
                                  padding=True, 
                                  return_tensors="pt")
        input_values = features.input_values.to(self.device)
        with torch.no_grad():
            logits = self.model(input_values).logits
        if self.lm:
            logits = logits.cpu().numpy()
            batch["predicted"] = []
            for sample_logits in logits:
                batch["predicted"].append(self.lm_decoder.decode(sample_logits))
        else:
            pred_ids = torch.argmax(logits, dim=-1)
            batch["predicted"] = self.processor.batch_decode(pred_ids)
        return batch

Tests

stt = STT(MODEL_NAME)

CETUC

ds = load_data('cetuc_dataset')
result = ds.map(stt.batch_predict, batched=True, batch_size=8) 
wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
print("CETUC WER:", wer)

Common Voice

ds = load_data('commonvoice_dataset')
result = ds.map(stt.batch_predict, batched=True, batch_size=8) 
wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
print("CV WER:", wer)

LaPS

ds = load_data('lapsbm_dataset')
result = ds.map(stt.batch_predict, batched=True, batch_size=8) 
wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
print("Laps WER:", wer)

MLS

ds = load_data('mls_dataset')
result = ds.map(stt.batch_predict, batched=True, batch_size=8) 
wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
print("MLS WER:", wer)

SID

ds = load_data('sid_dataset')
result = ds.map(stt.batch_predict, batched=True, batch_size=8) 
wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
print("Sid WER:", wer)

TEDx

ds = load_data('tedx_dataset')
result = ds.map(stt.batch_predict, batched=True, batch_size=8) 
wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
print("TEDx WER:", wer)

VoxForge

ds = load_data('voxforge_dataset')
result = ds.map(stt.batch_predict, batched=True, batch_size=8) 
wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
print("VoxForge WER:", wer)

Tests with LM

!rm -rf ~/.cache
!gdown --id 1GJIKseP5ZkTbllQVgOL98R4yYAcIySFP  # trained with wikipedia
stt = STT(MODEL_NAME, lm='pt-BR-wiki.word.4-gram.arpa')
# !gdown --id 1dLFldy7eguPtyJj5OAlI4Emnx0BpFywg  # trained with bp
# stt = STT(MODEL_NAME, lm='pt-BR.word.4-gram.arpa')

Cetuc

ds = load_data('cetuc_dataset')
result = ds.map(stt.batch_predict, batched=True, batch_size=8) 
wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
print("CETUC WER:", wer)

Common Voice

ds = load_data('commonvoice_dataset')
result = ds.map(stt.batch_predict, batched=True, batch_size=8) 
wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
print("CV WER:", wer)

LaPS

ds = load_data('lapsbm_dataset')
result = ds.map(stt.batch_predict, batched=True, batch_size=8) 
wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
print("Laps WER:", wer)

MLS

ds = load_data('mls_dataset')
result = ds.map(stt.batch_predict, batched=True, batch_size=8) 
wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
print("MLS WER:", wer)

SID

ds = load_data('sid_dataset')
result = ds.map(stt.batch_predict, batched=True, batch_size=8) 
wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
print("Sid WER:", wer)

TEDx

ds = load_data('tedx_dataset')
result = ds.map(stt.batch_predict, batched=True, batch_size=8) 
wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
print("TEDx WER:", wer)

VoxForge

ds = load_data('voxforge_dataset')
result = ds.map(stt.batch_predict, batched=True, batch_size=8) 
wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
print("VoxForge WER:", wer)

📄 License

This project is licensed under the Apache-2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご