bp500-base10k_voxpopuli Open-Source Speech Recognition Model - Accurately Identify Brazilian Portuguese Speech

Bp500 Base10k Voxpopuli

Developed by lgris

This is a Wav2vec 2.0 speech recognition model optimized for Brazilian Portuguese, fine-tuned on multiple Brazilian Portuguese datasets

Speech Recognition

Transformers

OtherOpen Source License:Apache-2.0 #Brazilian Portuguese Speech Recognition #Multi-dataset Training #Wav2Vec2 Fine-tuning

Downloads 23

Release Time : 3/2/2022

Model Overview

This model is an automatic speech recognition (ASR) system based on the Wav2vec 2.0 architecture, specifically optimized for Brazilian Portuguese, and performs excellently on multiple Brazilian Portuguese datasets

Model Features

Multi-dataset Training

Combines multiple Brazilian Portuguese datasets including CETUC, Common Voice, and LaPS BM, totaling over 450 hours of training data

Language Model Support

Can be combined with a 4-gram language model to further improve recognition accuracy

Extensive Testing and Validation

Comprehensively evaluated on 7 different test sets with an average WER of 18.1%

Model Capabilities

Brazilian Portuguese Speech Recognition

Speech-to-Text

Supports 16kHz Sample Rate Audio Processing

Use Cases

Speech Transcription

Brazilian Portuguese Speech Transcription

Convert Brazilian Portuguese speech into text

Achieves a WER of 12.0% on the CETUC test set, which can be reduced to 7.5% when combined with a language model

Voice Assistants

Brazilian Portuguese Voice Command Recognition

Used for voice command recognition in Brazilian Portuguese voice assistants or smart home devices

🚀 bp500-base10k_voxpopuli: Wav2vec 2.0 with Brazilian Portuguese (BP) Dataset

This project demonstrates a fine - tuned Wav2vec model for Brazilian Portuguese. It utilizes multiple datasets to enhance the model's performance in automatic speech recognition for Brazilian Portuguese.

🔍 Key Information

Property	Details
Datasets	Common Voice, MLS, CETUC, Lapsbm, Voxforge, Tedx, Sid
Metrics	WER (Word Error Rate)
Tags	audio, speech, wav2vec2, pt, portuguese - speech - corpus, automatic - speech - recognition, speech, PyTorch, hf - asr - leaderboard
Model Name	bp500 - base10k_voxpopuli
License	apache - 2.0

📊 Model Results

The model bp500 - base10k_voxpopuli was evaluated on the Common Voice dataset for the Automatic Speech Recognition task, achieving a Test WER of 24.9.

📦 Datasets Used

CETUC: It contains about 145 hours of Brazilian Portuguese speech from 50 male and 50 female speakers. Each speaker pronounces around 1,000 phonetically balanced sentences selected from the CETEN - Folha corpus.
Common Voice 7.0: A project by the Mozilla Foundation aiming to create an open - source dataset in multiple languages. Volunteers contribute and validate speech through the official site.
[Lapsbm](https://github.com/falabrasil/gitlab - resources): Used by the Fala Brasil group to benchmark ASR systems in Brazilian Portuguese. It has 35 speakers (10 females), with each speaker pronouncing 20 unique sentences, totaling 700 utterances. The audios were recorded at 22.05 kHz without environment control.
Multilingual Librispeech (MLS): A large - scale dataset available in many languages. The Portuguese subset (mostly Brazilian variant) used in this work has about 284 hours of speech from 55 audiobooks read by 62 speakers.
Multilingual TEDx: A collection of audio recordings from TEDx talks in 8 source languages. The Portuguese set (mostly Brazilian Portuguese) contains 164 hours of transcribed speech.
Sidney (SID): Contains 5,777 utterances from 72 speakers (20 women) aged 17 - 59, with additional information like place of birth, age, gender, education, and occupation.
VoxForge: A project to build open datasets for acoustic models. The corpus has around 100 speakers and 4,130 utterances of Brazilian Portuguese, with sample rates ranging from 16kHz to 44.1kHz.

📊 Dataset Usage Statistics

Dataset	Train	Valid	Test
CETUC	94.0h	--	5.4h
Common Voice	37.8h	8.9h	9.5h
LaPS BM	0.8h	--	0.1h
MLS	161.0h	--	3.7h
Multilingual TEDx (Portuguese)	148.9h	--	1.8h
SID	7.2h	--	1.0h
VoxForge	3.9h	--	0.1h
Total	453.6h	8.9h	21.6h

📈 Model Summary

	CETUC	CV	LaPS	MLS	SID	TEDx	VF	AVG
bp_500 - base10k_voxpopuli	0.120	0.249	0.039	0.227	0.169	0.349	0.116	0.181
bp_500 - base10k_voxpopuli + 4 - gram	0.074	0.174	0.032	0.182	0.181	0.349	0.111	0.157

📝 Transcription Examples

Text	Transcription
suco de uva e água misturam bem	suco deúva e água misturão bem
culpa do dinheiro	cupa do dinheiro
eu amo shooters call of duty é o meu favorito	eu omo shúters cofedete é meu favorito
você pode explicar por que isso acontece	você pode explicar por que isso ontece
no futuro você desejará ter começado a investir hoje	no futuro você desejará a ter começado a investir hoje

💻 Usage Examples

Basic Usage

MODEL_NAME = "lgris/bp500-base10k_voxpopuli"

Imports and Dependencies

%%capture
!pip install torch==1.8.2+cu111 torchvision==0.9.2+cu111 torchaudio===0.8.2 -f https://download.pytorch.org/whl/lts/1.8/torch_lts.html
!pip install datasets
!pip install jiwer
!pip install transformers
!pip install soundfile
!pip install pyctcdecode
!pip install https://github.com/kpu/kenlm/archive/master.zip

import jiwer
import torchaudio
from datasets import load_dataset, load_metric
from transformers import (
    Wav2Vec2ForCTC,
    Wav2Vec2Processor,
)
from pyctcdecode import build_ctcdecoder
import torch
import re
import sys

Helpers

chars_to_ignore_regex = '[\,\?\.\!\;\:\"]'  # noqa: W605

def map_to_array(batch):
    speech, _ = torchaudio.load(batch["path"])
    batch["speech"] = speech.squeeze(0).numpy() 
    batch["sampling_rate"] = 16_000 
    batch["sentence"] = re.sub(chars_to_ignore_regex, '', batch["sentence"]).lower().replace("’", "'")
    batch["target"] = batch["sentence"]
    return batch

def calc_metrics(truths, hypos):
    wers = []
    mers = []
    wils = []
    for t, h in zip(truths, hypos):
        try:
            wers.append(jiwer.wer(t, h))
            mers.append(jiwer.mer(t, h))
            wils.append(jiwer.wil(t, h))
        except: # Empty string?
            pass
    wer = sum(wers)/len(wers)
    mer = sum(mers)/len(mers)
    wil = sum(wils)/len(wils)
    return wer, mer, wil

def load_data(dataset):
    data_files = {'test': f'{dataset}/test.csv'}
    dataset = load_dataset('csv', data_files=data_files)["test"]
    return dataset.map(map_to_array)

Model

class STT:

    def __init__(self, 
                 model_name, 
                 device='cuda' if torch.cuda.is_available() else 'cpu', 
                 lm=None):
        self.model_name = model_name
        self.model = Wav2Vec2ForCTC.from_pretrained(model_name).to(device)
        self.processor = Wav2Vec2Processor.from_pretrained(model_name)
        self.vocab_dict = self.processor.tokenizer.get_vocab()
        self.sorted_dict = {
            k.lower(): v for k, v in sorted(self.vocab_dict.items(), 
                                            key=lambda item: item[1])
        }
        self.device = device
        self.lm = lm
        if self.lm:            
            self.lm_decoder = build_ctcdecoder(
                list(self.sorted_dict.keys()),
                self.lm
            )

    def batch_predict(self, batch):
        features = self.processor(batch["speech"], 
                                  sampling_rate=batch["sampling_rate"][0], 
                                  padding=True, 
                                  return_tensors="pt")
        input_values = features.input_values.to(self.device)
        with torch.no_grad():
            logits = self.model(input_values).logits
        if self.lm:
            logits = logits.cpu().numpy()
            batch["predicted"] = []
            for sample_logits in logits:
                batch["predicted"].append(self.lm_decoder.decode(sample_logits))
        else:
            pred_ids = torch.argmax(logits, dim=-1)
            batch["predicted"] = self.processor.batch_decode(pred_ids)
        return batch

Download Datasets

%%capture
!gdown --id 1HFECzIizf-bmkQRLiQD0QVqcGtOG5upI
!mkdir bp_dataset
!unzip bp_dataset -d bp_dataset/

%cd bp_dataset

Tests

stt = STT(MODEL_NAME)

CETUC

ds = load_data('cetuc_dataset')
result = ds.map(stt.batch_predict, batched=True, batch_size=8) 
wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
print("CETUC WER:", wer)

Common Voice

ds = load_data('commonvoice_dataset')
result = ds.map(stt.batch_predict, batched=True, batch_size=8) 
wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
print("CV WER:", wer)

LaPS

ds = load_data('lapsbm_dataset')
result = ds.map(stt.batch_predict, batched=True, batch_size=8) 
wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
print("Laps WER:", wer)

MLS

ds = load_data('mls_dataset')
result = ds.map(stt.batch_predict, batched=True, batch_size=8) 
wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
print("MLS WER:", wer)

SID

ds = load_data('sid_dataset')
result = ds.map(stt.batch_predict, batched=True, batch_size=8) 
wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
print("Sid WER:", wer)

TEDx

ds = load_data('tedx_dataset')
result = ds.map(stt.batch_predict, batched=True, batch_size=8) 
wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
print("TEDx WER:", wer)

VoxForge

ds = load_data('voxforge_dataset')
result = ds.map(stt.batch_predict, batched=True, batch_size=8) 
wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
print("VoxForge WER:", wer)

Tests with LM

!rm -rf ~/.cache
!gdown --id 1GJIKseP5ZkTbllQVgOL98R4yYAcIySFP  # trained with wikipedia
stt = STT(MODEL_NAME, lm='pt-BR-wiki.word.4-gram.arpa')
# !gdown --id 1dLFldy7eguPtyJj5OAlI4Emnx0BpFywg  # trained with bp
# stt = STT(MODEL_NAME, lm='pt-BR.word.4-gram.arpa')

Cetuc

ds = load_data('cetuc_dataset')
result = ds.map(stt.batch_predict, batched=True, batch_size=8) 
wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
print("CETUC WER:", wer)

Common Voice

ds = load_data('commonvoice_dataset')
result = ds.map(stt.batch_predict, batched=True, batch_size=8) 
wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
print("CV WER:", wer)

LaPS

ds = load_data('lapsbm_dataset')
result = ds.map(stt.batch_predict, batched=True, batch_size=8) 
wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
print("Laps WER:", wer)

MLS

ds = load_data('mls_dataset')
result = ds.map(stt.batch_predict, batched=True, batch_size=8) 
wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
print("MLS WER:", wer)

SID

ds = load_data('sid_dataset')
result = ds.map(stt.batch_predict, batched=True, batch_size=8) 
wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
print("Sid WER:", wer)

TEDx

ds = load_data('tedx_dataset')
result = ds.map(stt.batch_predict, batched=True, batch_size=8) 
wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
print("TEDx WER:", wer)

VoxForge

ds = load_data('voxforge_dataset')
result = ds.map(stt.batch_predict, batched=True, batch_size=8) 
wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
print("VoxForge WER:", wer)

📄 License

This project is licensed under the [apache - 2.0](https://www.apache.org/licenses/LICENSE - 2.0) license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご