bp500-xlsr Open-source Speech Recognition Model - Accurately Supports Brazilian Portuguese Speech Transcription

Bp500 Xlsr

Developed by lgris

This is a Wav2vec 2.0 model fine-tuned for Brazilian Portuguese, trained on multiple Brazilian Portuguese datasets, achieving a WER of 13.6 on the Common Voice test set.

Speech Recognition

Transformers

OtherOpen Source License:Apache-2.0 #Brazilian Portuguese speech recognition #Multi-dataset training #Wav2Vec2 fine-tuning

Downloads 21

Release Time : 3/2/2022

Model Overview

This model is an automatic speech recognition (ASR) model based on the Wav2vec 2.0 architecture, specifically optimized for Brazilian Portuguese. It integrates multiple Brazilian Portuguese datasets, including CETUC, Common Voice, LaPS BM, etc., with a total training data volume exceeding 400 hours.

Model Features

Multi-dataset training

Integrates 7 different Brazilian Portuguese datasets with total training duration exceeding 400 hours

Language model support

Supports combination with 4-gram language model to further improve recognition accuracy

Low WER

Excellent performance on multiple test sets with an average WER of 10.8%

Model Capabilities

Brazilian Portuguese speech recognition

Supports multiple audio sampling rates

Can be combined with language models to enhance performance

Use Cases

Speech-to-text

Speech transcription

Convert Brazilian Portuguese speech content into text

WER of 13.6% on Common Voice test set

Voice assistant

Brazilian Portuguese voice command recognition

Used for front-end speech recognition in Brazilian Portuguese voice assistants

🚀 bp500-xlsr: Wav2vec 2.0 with Brazilian Portuguese (BP) Dataset

This project presents a fine - tuned Wav2vec model for Brazilian Portuguese. It leverages multiple datasets to enhance the performance of automatic speech recognition in Brazilian Portuguese.

✨ Features

Diverse Datasets: Utilizes a combination of datasets including CETUC, Common Voice 7.0, Lapsbm, Multilingual Librispeech (MLS), and more to cover a wide range of Brazilian Portuguese speech scenarios.
Fine - Tuned Model: The model is fine - tuned using the fairseq framework, optimizing it for Brazilian Portuguese speech recognition.
Metrics Evaluation: Evaluates the model's performance using the Word Error Rate (WER) metric on various datasets.

📦 Installation

Install Dependencies

%%capture
!pip install torch==1.8.2+cu111 torchvision==0.9.2+cu111 torchaudio===0.8.2 -f https://download.pytorch.org/whl/lts/1.8/torch_lts.html
!pip install datasets
!pip install jiwer
!pip install transformers
!pip install soundfile
!pip install pyctcdecode
!pip install https://github.com/kpu/kenlm/archive/master.zip

Download Datasets

%%capture
!gdown --id 1HFECzIizf-bmkQRLiQD0QVqcGtOG5upI
!mkdir bp_dataset
!unzip bp_dataset -d bp_dataset/
%cd bp_dataset

💻 Usage Examples

Basic Usage

MODEL_NAME = "lgris/bp500-xlsr"

Advanced Usage

import jiwer
import torchaudio
from datasets import load_dataset, load_metric
from transformers import (
    Wav2Vec2ForCTC,
    Wav2Vec2Processor,
)
from pyctcdecode import build_ctcdecoder
import torch
import re
import sys

chars_to_ignore_regex = '[\,\?\.\!\;\:\"]'  # noqa: W605

def map_to_array(batch):
    speech, _ = torchaudio.load(batch["path"])
    batch["speech"] = speech.squeeze(0).numpy() 
    batch["sampling_rate"] = 16_000 
    batch["sentence"] = re.sub(chars_to_ignore_regex, '', batch["sentence"]).lower().replace("’", "'")
    batch["target"] = batch["sentence"]
    return batch

def calc_metrics(truths, hypos):
    wers = []
    mers = []
    wils = []
    for t, h in zip(truths, hypos):
        try:
            wers.append(jiwer.wer(t, h))
            mers.append(jiwer.mer(t, h))
            wils.append(jiwer.wil(t, h))
        except: # Empty string?
            pass
    wer = sum(wers)/len(wers)
    mer = sum(mers)/len(mers)
    wil = sum(wils)/len(wils)
    return wer, mer, wil

def load_data(dataset):
    data_files = {'test': f'{dataset}/test.csv'}
    dataset = load_dataset('csv', data_files=data_files)["test"]
    return dataset.map(map_to_array)

class STT:

    def __init__(self, 
                 model_name, 
                 device='cuda' if torch.cuda.is_available() else 'cpu', 
                 lm=None):
        self.model_name = model_name
        self.model = Wav2Vec2ForCTC.from_pretrained(model_name).to(device)
        self.processor = Wav2Vec2Processor.from_pretrained(model_name)
        self.vocab_dict = self.processor.tokenizer.get_vocab()
        self.sorted_dict = {
            k.lower(): v for k, v in sorted(self.vocab_dict.items(), 
                                            key=lambda item: item[1])
        }
        self.device = device
        self.lm = lm
        if self.lm:            
            self.lm_decoder = build_ctcdecoder(
                list(self.sorted_dict.keys()),
                self.lm
            )

    def batch_predict(self, batch):
        features = self.processor(batch["speech"], 
                                  sampling_rate=batch["sampling_rate"][0], 
                                  padding=True, 
                                  return_tensors="pt")
        input_values = features.input_values.to(self.device)
        attention_mask = features.attention_mask.to(self.device)
        with torch.no_grad():
            logits = self.model(input_values, attention_mask=attention_mask).logits
        if self.lm:
            logits = logits.cpu().numpy()
            batch["predicted"] = []
            for sample_logits in logits:
                batch["predicted"].append(self.lm_decoder.decode(sample_logits))
        else:
            pred_ids = torch.argmax(logits, dim=-1)
            batch["predicted"] = self.processor.batch_decode(pred_ids)
        return batch

stt = STT(MODEL_NAME)

# Test on CETUC
ds = load_data('cetuc_dataset')
result = ds.map(stt.batch_predict, batched=True, batch_size=8) 
wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
print("CETUC WER:", wer)

# Test on Common Voice
ds = load_data('commonvoice_dataset')
result = ds.map(stt.batch_predict, batched=True, batch_size=8) 
wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
print("CV WER:", wer)

# And so on for other datasets...

📚 Documentation

Datasets

CETUC: Contains approximately 145 hours of Brazilian Portuguese speech from 50 male and 50 female speakers, with each speaker pronouncing about 1,000 phonetically balanced sentences from the CETEN - Folha corpus.
Common Voice 7.0: A project by the Mozilla Foundation aiming to create an open - source multilingual speech dataset. Volunteers contribute and validate speech on the official site.
Lapsbm: Used by the Fala Brasil group to benchmark ASR systems in Brazilian Portuguese. It has 35 speakers (10 females) with 20 unique sentences each, totaling 700 Brazilian Portuguese utterances.
Multilingual Librispeech (MLS): A large - scale multilingual dataset based on public - domain audiobook recordings. The Portuguese subset used in this work has about 284 hours of speech.
VoxForge: A project to build open datasets for acoustic models, with about 100 speakers and 4,130 Brazilian Portuguese utterances.

Model Performance

	CETUC	CV	LaPS	MLS	SID	TEDx	VF	AVG
bp_500 (demonstration below)	0.051	0.136	0.032	0.118	0.095	0.248	0.082	0.108
bp_500 + 4 - gram (demonstration below)	0.032	0.097	0.022	0.114	0.125	0.246	0.065	0.100

Transcription Examples

Text	Transcription
não há um departamento de mediadores independente das federações e das agremiações	não há um dearamento de mediadores independente das federações e das agrebiações
mas que bodega	masque bodega
a cortina abriu o show começou	a cortina abriu o chô começou
por sorte havia uma passadeira	busote avinhoa passadeiro
estou maravilhada está tudo pronto	stou estou maravilhada está tudo pronto

🔧 Technical Details

The original model is fine - tuned using fairseq. The link to the original fairseq model is available [here](https://drive.google.com/file/d/1J8aR1ltDLQFe - dVrGuyxoRm2uyJjCWgf/view?usp=sharing). The fine - tuning process involves training the model on the combined Brazilian Portuguese datasets, with the Common Voice dev/test sets used for validation and testing respectively.

📄 License

This project is licensed under the apache - 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご