bp400-xlsr Open-Source Speech Recognition Model - Free Deployment Supports Automatic Recognition of Brazilian Portuguese

Bp400 Xlsr

Developed by lgris

A Wav2vec 2.0 speech recognition model fine-tuned on Brazilian Portuguese datasets, supporting automatic speech recognition tasks for Brazilian Portuguese.

Speech Recognition

Transformers

OtherOpen Source License:Apache-2.0 #Brazilian Portuguese Speech Recognition #Multi-dataset Training #Low WER

Downloads 55

Release Time : 3/2/2022

Model Overview

This model is an automatic speech recognition (ASR) system optimized for Brazilian Portuguese, based on the Wav2vec 2.0 architecture and fine-tuned on multiple Brazilian Portuguese datasets.

Model Features

Multi-dataset Training

The model integrates 7 Brazilian Portuguese datasets, including CETUC, Common Voice, etc., totaling over 400 hours of training data.

Language Model Support

Can be combined with a 4-gram language model to further improve recognition accuracy, reducing the average WER from 12.4% to 10.5%.

High Accuracy

Performs excellently on multiple test sets, achieving a WER as low as 3.0% on the CETUC test set and 9.6% on the Common Voice test set.

Model Capabilities

Brazilian Portuguese Speech Recognition

Audio Transcription

Speech-to-Text

Use Cases

Speech Transcription

Brazilian Portuguese Speech Transcription

Convert Brazilian Portuguese speech content into text

Achieves high accuracy with a 3.0% WER on the CETUC dataset

Voice Assistants

Brazilian Portuguese Voice Command Recognition

Used for command recognition in Brazilian Portuguese voice assistant systems

🚀 bp400-xlsr: Wav2vec 2.0 with Brazilian Portuguese (BP) Dataset

This is a fine - tuned Wav2vec model for Brazilian Portuguese. It uses multiple datasets to enhance the performance of automatic speech recognition in Brazilian Portuguese.

✨ Features

Diverse Datasets: Utilizes various Brazilian Portuguese speech datasets such as CETUC, Common Voice 7.0, Lapsbm, Multilingual Librispeech (MLS), Multilingual TEDx, Sidney (SID), and VoxForge.
Fine - Tuned Model: The original model is fine - tuned using fairseq to adapt to Brazilian Portuguese.
Performance Metrics: Evaluates the model's performance using Word Error Rate (WER).

📦 Installation

The installation process involves installing several Python libraries. You can use the following commands:

%%capture
!pip install torch==1.8.2+cu111 torchvision==0.9.2+cu111 torchaudio===0.8.2 -f https://download.pytorch.org/whl/lts/1.8/torch_lts.html
!pip install datasets
!pip install jiwer
!pip install transformers
!pip install soundfile
!pip install pyctcdecode
!pip install https://github.com/kpu/kenlm/archive/master.zip

💻 Usage Examples

Basic Usage

MODEL_NAME = "lgris/bp400-xlsr"

Advanced Usage

import jiwer
import torchaudio
from datasets import load_dataset, load_metric
from transformers import (
    Wav2Vec2ForCTC,
    Wav2Vec2Processor,
)
from pyctcdecode import build_ctcdecoder
import torch
import re
import sys

chars_to_ignore_regex = '[\,\?\.\!\;\:\"]'  # noqa: W605

def map_to_array(batch):
    speech, _ = torchaudio.load(batch["path"])
    batch["speech"] = speech.squeeze(0).numpy() 
    batch["sampling_rate"] = 16_000 
    batch["sentence"] = re.sub(chars_to_ignore_regex, '', batch["sentence"]).lower().replace("’", "'")
    batch["target"] = batch["sentence"]
    return batch

def calc_metrics(truths, hypos):
    wers = []
    mers = []
    wils = []
    for t, h in zip(truths, hypos):
        try:
            wers.append(jiwer.wer(t, h))
            mers.append(jiwer.mer(t, h))
            wils.append(jiwer.wil(t, h))
        except: # Empty string?
            pass
    wer = sum(wers)/len(wers)
    mer = sum(mers)/len(mers)
    wil = sum(wils)/len(wils)
    return wer, mer, wil

def load_data(dataset):
    data_files = {'test': f'{dataset}/test.csv'}
    dataset = load_dataset('csv', data_files=data_files)["test"]
    return dataset.map(map_to_array)

class STT:

    def __init__(self, 
                 model_name, 
                 device='cuda' if torch.cuda.is_available() else 'cpu', 
                 lm=None):
        self.model_name = model_name
        self.model = Wav2Vec2ForCTC.from_pretrained(model_name).to(device)
        self.processor = Wav2Vec2Processor.from_pretrained(model_name)
        self.vocab_dict = self.processor.tokenizer.get_vocab()
        self.sorted_dict = {
            k.lower(): v for k, v in sorted(self.vocab_dict.items(), 
                                            key=lambda item: item[1])
        }
        self.device = device
        self.lm = lm
        if self.lm:            
            self.lm_decoder = build_ctcdecoder(
                list(self.sorted_dict.keys()),
                self.lm
            )

    def batch_predict(self, batch):
        features = self.processor(batch["speech"], 
                                  sampling_rate=batch["sampling_rate"][0], 
                                  padding=True, 
                                  return_tensors="pt")
        input_values = features.input_values.to(self.device)
        attention_mask = features.attention_mask.to(self.device)
        with torch.no_grad():
            logits = self.model(input_values, attention_mask=attention_mask).logits
        if self.lm:
            logits = logits.cpu().numpy()
            batch["predicted"] = []
            for sample_logits in logits:
                batch["predicted"].append(self.lm_decoder.decode(sample_logits))
        else:
            pred_ids = torch.argmax(logits, dim=-1)
            batch["predicted"] = self.processor.batch_decode(pred_ids)
        return batch

# Download datasets
%%capture
!gdown --id 1HFECzIizf-bmkQRLiQD0QVqcGtOG5upI
!mkdir bp_dataset
!unzip bp_dataset -d bp_dataset/

# Tests
stt = STT(MODEL_NAME)

# CETUC
ds = load_data('cetuc_dataset')
result = ds.map(stt.batch_predict, batched=True, batch_size=8) 
wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
print("CETUC WER:", wer)

# Common Voice
ds = load_data('commonvoice_dataset')
result = ds.map(stt.batch_predict, batched=True, batch_size=8) 
wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
print("CV WER:", wer)

# LaPS
ds = load_data('lapsbm_dataset')
result = ds.map(stt.batch_predict, batched=True, batch_size=8) 
wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
print("Laps WER:", wer)

# MLS
ds = load_data('mls_dataset')
result = ds.map(stt.batch_predict, batched=True, batch_size=8) 
wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
print("MLS WER:", wer)

# SID
ds = load_data('sid_dataset')
result = ds.map(stt.batch_predict, batched=True, batch_size=8) 
wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
print("Sid WER:", wer)

# TEDx
ds = load_data('tedx_dataset')
result = ds.map(stt.batch_predict, batched=True, batch_size=8) 
wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
print("TEDx WER:", wer)

# VoxForge
ds = load_data('voxforge_dataset')
result = ds.map(stt.batch_predict, batched=True, batch_size=8) 
wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
print("VoxForge WER:", wer)

# Tests with LM
!rm -rf ~/.cache
!gdown --id 1GJIKseP5ZkTbllQVgOL98R4yYAcIySFP  # trained with wikipedia
stt = STT(MODEL_NAME, lm='pt-BR-wiki.word.4-gram.arpa')
# !gdown --id 1dLFldy7eguPtyJj5OAlI4Emnx0BpFywg  # trained with bp
# stt = STT(MODEL_NAME, lm='pt-BR.word.4-gram.arpa')

# Cetuc
ds = load_data('cetuc_dataset')
result = ds.map(stt.batch_predict, batched=True, batch_size=8) 
wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
print("CETUC WER:", wer)

# Common Voice
ds = load_data('commonvoice_dataset')
result = ds.map(stt.batch_predict, batched=True, batch_size=8) 
wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
print("CV WER:", wer)

# LaPS
ds = load_data('lapsbm_dataset')
result = ds.map(stt.batch_predict, batched=True, batch_size=8) 
wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
print("Laps WER:", wer)

# MLS
ds = load_data('mls_dataset')
result = ds.map(stt.batch_predict, batched=True, batch_size=8) 
wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
print("MLS WER:", wer)

# SID
ds = load_data('sid_dataset')
result = ds.map(stt.batch_predict, batched=True, batch_size=8) 
wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
print("Sid WER:", wer)

📚 Documentation

Datasets

Property	Details
Model Type	bp400 - xlsr: Wav2vec 2.0 fine - tuned for Brazilian Portuguese
Training Data	CETUC, Common Voice 7.0, Lapsbm, Multilingual Librispeech (MLS), Multilingual TEDx, Sidney (SID), VoxForge

Summary

	CETUC	CV	LaPS	MLS	SID	TEDx	VF	AVG
bp_400 (demonstration below)	0.052	0.140	0.074	0.117	0.121	0.245	0.118	0.124
bp_400 + 3 - gram	0.033	0.095	0.046	0.123	0.112	0.212	0.123	0.106
bp_400 + 4 - gram (demonstration below)	0.030	0.096	0.043	0.106	0.118	0.229	0.117	0.105
bp_400 + 5 - gram	0.033	0.094	0.043	0.123	0.111	0.210	0.123	0.105
bp_400 + Transf.	0.032	0.092	0.036	0.130	0.115	0.215	0.125	0.106

Transcription examples

Text	Transcription
alguém sabe a que horas começa o jantar	alguém sabe a que horas começo jantar
lila covas ainda não sabe o que vai fazer no fundo	lilacovas ainda não sabe o que vai fazer no fundo
que tal um pouco desse bom spaghetti	quetá um pouco deste bom ispaguete
hong kong em cantonês significa porto perfumado	rongkong en cantones significa porto perfumado
vamos hackear esse problema	vamos rackar esse problema
apenas a poucos metros há uma estação de ônibus	apenas ha poucos metros á uma estação de ônibus
relâmpago e trovão sempre andam juntos	relampagotrevão sempre andam juntos

📄 License

This project is licensed under the Apache - 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご