Wav2vec2 Brazilian Portuguese Model Open-sourced - Free for Automatic Speech Recognition Tasks

Wav2vec2 Large Xlsr Open Brazilian Portuguese V2

Developed by lgris

This is a Wav2vec2 model optimized for Brazilian Portuguese, trained on multiple open datasets for automatic speech recognition tasks.

Speech Recognition

Transformers

OtherOpen Source License:Apache-2.0 #Brazilian Portuguese ASR #Multi-dataset training #Low WER

Downloads 1,825

Release Time : 3/2/2022

Model Overview

This model is an automatic speech recognition (ASR) model based on the Wav2vec2 architecture, specifically fine-tuned for Brazilian Portuguese. It integrates multiple publicly available Brazilian Portuguese speech datasets and can convert Portuguese speech into text.

Model Features

Multi-dataset training

Integrates multiple Brazilian Portuguese datasets including CETUC, MLS, VoxForge, Common Voice, and Lapsbm, improving the model's generalization capability.

High performance

Achieves a word error rate (WER) of 10.69% on the Common Voice test set.

Open license

Released under the Apache 2.0 license, allowing for commercial and research use.

Model Capabilities

Brazilian Portuguese speech recognition

Speech-to-text

Supports multiple audio sampling rates

Use Cases

Speech transcription

Meeting minutes

Automatically transcribe Brazilian Portuguese meeting recordings into text records.

Performs well in formal speech scenarios.

Subtitle generation

Automatically generate subtitles for Brazilian Portuguese video content.

High accuracy on clear speech.

Voice assistants

Portuguese voice command recognition

Used as a foundational speech recognition component for Brazilian Portuguese voice assistants.

Suitable for command and control scenarios.

🚀 Wav2vec 2.0 With Open Brazilian Portuguese Datasets v2

This project demonstrates a fine - tuned Wav2vec model for Brazilian Portuguese using multiple datasets, aiming to improve automatic speech recognition for Brazilian Portuguese.

✨ Features

Diverse Datasets: Utilizes multiple Brazilian Portuguese speech datasets, including CETUC, Multilingual Librispeech (MLS), VoxForge, Common Voice 6.1, and Lapsbm.
Fine - Tuned Model: The original Wav2vec model is fine - tuned using the fairseq library.
Performance Metrics: Evaluates the model's performance on different datasets with the Word Error Rate (WER) metric.

📦 Installation

Install Dependencies

%%capture
!pip install datasets
!pip install jiwer
!pip install torchaudio
!pip install transformers
!pip install soundfile

💻 Usage Examples

Imports and Dependencies

import torchaudio
from datasets import load_dataset, load_metric
from transformers import (
    Wav2Vec2ForCTC,
    Wav2Vec2Processor,
)
import torch
import re
import sys

Preparation

chars_to_ignore_regex = '[\,\?\.\!\;\:\"]'  # noqa: W605
wer = load_metric("wer")
device = "cuda"

model_name = 'lgris/wav2vec2-large-xlsr-open-brazilian-portuguese-v2'
model = Wav2Vec2ForCTC.from_pretrained(model_name).to(device)
processor = Wav2Vec2Processor.from_pretrained(model_name)

def map_to_pred(batch):
    features = processor(batch["speech"], sampling_rate=batch["sampling_rate"][0], padding=True, return_tensors="pt")
    input_values = features.input_values.to(device)
    attention_mask = features.attention_mask.to(device)
    with torch.no_grad():
        logits = model(input_values, attention_mask=attention_mask).logits
    pred_ids = torch.argmax(logits, dim=-1)
    batch["predicted"] = processor.batch_decode(pred_ids)
    batch["predicted"] = [pred.lower() for pred in batch["predicted"]]
    batch["target"] = batch["sentence"]
    return batch

Tests

Test against Common Voice (In - domain)

dataset = load_dataset("common_voice", "pt", split="test", data_dir="./cv-corpus-6.1-2020-12-11")

resampler = torchaudio.transforms.Resample(orig_freq=48_000, new_freq=16_000)

def map_to_array(batch):
    speech, _ = torchaudio.load(batch["path"])
    batch["speech"] = resampler.forward(speech.squeeze(0)).numpy()
    batch["sampling_rate"] = resampler.new_freq
    batch["sentence"] = re.sub(chars_to_ignore_regex, '', batch["sentence"]).lower().replace("’", "'")
    return batch

ds = dataset.map(map_to_array)
result = ds.map(map_to_pred, batched=True, batch_size=1, remove_columns=list(ds.features.keys()))
print(wer.compute(predictions=result["predicted"], references=result["target"]))
for pred, target in zip(result["predicted"][:10], result["target"][:10]):
    print(pred, "|", target)

Result: 10.69%

Test against TEDx (Out - of - domain)

!gdown --id 1HJEnvthaGYwcV_whHEywgH2daIN4bQna
!tar -xf tedx.tar.gz

dataset = load_dataset('csv', data_files={'test': 'test.csv'})['test']

def map_to_array(batch):
    speech, _ = torchaudio.load(batch["path"])
    batch["speech"] = speech.squeeze(0).numpy()
    batch["sampling_rate"] = resampler.new_freq
    batch["sentence"] = re.sub(chars_to_ignore_regex, '', batch["sentence"]).lower().replace("’", "'")
    return batch

ds = dataset.map(map_to_array)
result = ds.map(map_to_pred, batched=True, batch_size=1, remove_columns=list(ds.features.keys()))
print(wer.compute(predictions=result["predicted"], references=result["target"]))
for pred, target in zip(result["predicted"][:10], result["target"][:10]):
    print(pred, "|", target)

Result: 34.53%

📚 Documentation

Datasets

CETUC: Contains approximately 145 hours of Brazilian Portuguese speech distributed among 50 male and 50 female speakers, each pronouncing approximately 1,000 phonetically balanced sentences selected from the CETEN - Folha corpus.
Multilingual Librispeech (MLS): A massive dataset available in many languages. Based on audiobook recordings in the public domain like LibriVox. The Portuguese set used in this work (mostly Brazilian variant) has approximately 284 hours of speech, obtained from 55 audiobooks read by 62 speakers.
VoxForge: A project aiming to build open datasets for acoustic models. The corpus contains approximately 100 speakers and 4,130 utterances of Brazilian Portuguese, with sample rates varying from 16kHz to 44.1kHz.
Common Voice 6.1: A project proposed by the Mozilla Foundation to create a wide - open dataset in different languages for training ASR models. Volunteers donate and validate speech using the official site. The Portuguese set (mostly Brazilian variant) used in this work is the 6.1 version (pt_63h_2020 - 12 - 11), containing about 50 validated hours and 1,120 unique speakers.
[Lapsbm](https://github.com/falabrasil/gitlab - resources): "Falabrasil - UFPA" is a dataset used by the Fala Brasil group to benchmark ASR systems in Brazilian Portuguese. Contains 35 speakers (10 females), each pronouncing 20 unique sentences, totalling 700 utterances in Brazilian Portuguese. The audios were recorded at 22.05 kHz without environment control.

Model Information

Model Name: wav2vec2 - large - xlsr - open - brazilian - portuguese - v2
Task: Automatic Speech Recognition
Dataset: Common Voice (pt)
Metric: Test WER = 10.69

📄 License

This project is licensed under the [Apache 2.0 license](https://www.apache.org/licenses/LICENSE - 2.0).

⚠️ Important Note

The common voice test reports 10% of WER. However, this model was trained using all the validated instances of Common Voice, except the instances of the test set. This means that some speakers of the train set can be present on the test set.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご