HIYACCENT_Wav2Vec2 Open-source Speech Recognition Model - Accurately Recognize Nigerian English Accent

HIYACCENT Wav2Vec2

Developed by codeceejay

HIYACCENT is a speech recognition system optimized for Nigerian English accents, built upon an enhanced Wav2Vec2 architecture with over 20% performance improvement.

Speech Recognition

Transformers

#Nigerian English Recognition #Accent Adaptation #Wav2Vec2 Fine-tuning

Downloads 27

Release Time : 3/2/2022

Model Overview

This model captures the differences between baseline models and Nigerian English speech by adding new network layers to the Facebook Wav2vec architecture. It incorporates a CTC loss function at the top layer to enhance speech-text alignment flexibility, specifically developed for Nigerian English speakers significantly influenced by native pronunciation.

Model Features

Nigerian Accent Optimization

Specifically optimized for the pronunciation characteristics of Nigerian English speakers, achieving over 20% recognition performance improvement.

Enhanced Wav2Vec2 Architecture

Additional network layers are added to the standard Wav2vec architecture to better capture pronunciation differences between Nigerian English and standard English.

CTC Loss Function

Incorporates a CTC loss function at the top layer to enhance speech-text alignment flexibility.

Model Capabilities

Nigerian-accented English speech recognition

16kHz sampling rate speech processing

Use Cases

Speech Transcription

Nigerian English Speech Transcription

Accurately transcribes Nigerian English speakers' speech into text

Over 20% performance improvement compared to standard models

Voice Assistants

Nigerian Accent Voice Interaction

Provides more accurate voice assistant interaction experiences for Nigerian users

🚀 HIYACCENT: An Improved Nigerian-Accented Speech Recognition System Based on Contrastive Learning

This research aims to develop a more robust model for Nigerian English speakers, whose English pronunciations are significantly influenced by their mother tongues. To achieve this, the Wav2Vec - HIYACCENT model was proposed. It adds a new layer to the novel Facebook Wav2vec to capture the differences between the baseline model and Nigerian English speeches. Additionally, a CTC loss is incorporated on top of the model, enhancing the flexibility of speech - text alignment. This leads to over a 20% performance improvement for NAE.T.

The facebook/wav2vec2 - large model was fine - tuned on English using the UISpeech Corpus. When using this model, ensure that your speech input is sampled at 16kHz.

The training script can be found here: https://github.com/amceejay/HIYACCENT-NE-Speech-Recognition-System

🚀 Quick Start

✨ Features

The Wav2Vec - HIYACCENT model captures the disparity between the baseline model and Nigerian English speeches by adding a new layer to the novel Facebook Wav2vec.
Incorporates a CTC loss on top of the model to enhance the flexibility of speech - text alignment.
Achieves over 20% performance improvement for NAE.T.

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

The model can be used directly (without a language model) as follows:

Using the ASRecognition library

from asrecognition import ASREngine

asr = ASREngine("fr", model_path="codeceejay/HIYACCENT_Wav2Vec2")

audio_paths = ["/path/to/file.mp3", "/path/to/another_file.wav"]
transcriptions = asr.transcribe(audio_paths)

Advanced Usage

Writing your own inference speech:

import torch
import librosa
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

LANG_ID = "en"
MODEL_ID = "codeceejay/HIYACCENT_Wav2Vec2"
SAMPLES = 10

#You can use common_voice/timit or Nigerian Accented Speeches can also be found here: https://openslr.org/70/
test_dataset = load_dataset("common_voice", LANG_ID, split=f"test[:{SAMPLES}]")

processor = Wav2Vec2Processor.from_pretrained(MODEL_ID)
model = Wav2Vec2ForCTC.from_pretrained(MODEL_ID)

# Preprocessing the datasets.
# We need to read the audio files as arrays
def speech_file_to_array_fn(batch):
    speech_array, sampling_rate = librosa.load(batch["path"], sr=16_000)
    batch["speech"] = speech_array
    batch["sentence"] = batch["sentence"].upper()
    return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)
inputs = processor(test_dataset["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)

with torch.no_grad():
    logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits

predicted_ids = torch.argmax(logits, dim=-1)
predicted_sentences = processor.batch_decode(predicted_ids)

for i, predicted_sentence in enumerate(predicted_sentences):
    print("-" * 100)
    print("Reference:", test_dataset[i]["sentence"])
    print("Prediction:", predicted_sentence)

📚 Documentation

The model fine - tuned the facebook/wav2vec2 - large on English using the UISpeech Corpus. When using this model, make sure that your speech input is sampled at 16kHz.

🔧 Technical Details

The Wav2Vec - HIYACCENT model was proposed to address the issue of Nigerian English speakers whose English pronunciations are affected by their mother tongues. A new layer is added to the novel Facebook Wav2vec to capture the disparity between the baseline model and Nigerian English speeches. A CTC loss is also inserted on top of the model to add flexibility to the speech - text alignment, resulting in over 20% performance improvement for NAE.T.

📄 License

No license information is provided in the original document.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご