wav2vec2-large-mms-1b-wolof Open-Source Model - Free Realization of Automatic Speech Recognition for Wolof

Wav2vec2 Large Mms 1b Wolof

Developed by bilalfaye

This model is a fine-tuned version of facebook/mms-1b-all on the Isma/alffa_wolof dataset, specifically designed for Wolof automatic speech recognition (ASR) tasks.

Speech Recognition

Safetensors

OtherOpen Source License:MIT #Wolof ASR #Low-resource speech recognition #MMS fine-tuning

Downloads 50

Release Time : 1/8/2025

Model Overview

This model is based on the Wav2Vec 2.0 architecture and fine-tuned for speech recognition tasks. The base model, facebook/mms-1b-all, is a general-purpose ASR model trained on a multilingual corpus. This fine-tuned version was specifically trained on the Waxal Wolof dataset, which contains audio recordings in Wolof.

Model Features

Multilingual support

Based on the facebook/mms-1b-all model, it supports multilingual speech recognition.

Wolof optimization

Fine-tuned specifically on Wolof datasets, improving recognition accuracy for Wolof speech features.

Efficient training

Uses mixed-precision training and the Adam optimizer for high training efficiency.

Model Capabilities

Wolof speech recognition

Multilingual speech recognition

Use Cases

Speech-to-text

Wolof audio transcription

Transcribe Wolof audio recordings into text.

Word Error Rate (WER) of 0.1842

🚀 wav2vec2-large-mms-1b-wolof

This model is a fine - tuned version of facebook/mms-1b-all on the Isma/alffa_wolof dataset, designed for automatic speech recognition in the Wolof language.

🚀 Quick Start

This model is a fine - tuned version of facebook/mms-1b-all on the Isma/alffa_wolof dataset. It is designed to perform automatic speech recognition (ASR) in the Wolof language.

✨ Features

Based on the Wav2Vec 2.0 architecture, fine - tuned for speech recognition tasks.
Specifically trained on the Waxal Wolof dataset to handle the phonetic characteristics of Wolof speech.

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

! pip install datasets

# Load test dataset
from datasets import load_dataset, Audio

dataset = load_dataset("perrynelson/waxal-wolof", trust_remote_code=True)
dataset

# Display the first audio using Ipython
from IPython.display import Audio, display

Audio(dataset['train'][322]['audio']['array'], rate=16000)

from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import torch

model_id = "bilalfaye/wav2vec2-large-mms-1b-wolof"

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Load the model on CPU first
model = Wav2Vec2ForCTC.from_pretrained(model_id, 
                                       target_lang="wol", 
                                       torch_dtype=torch.float16  # Use half-precision
                                       ).to(device)


processor = Wav2Vec2Processor.from_pretrained(model_id)
processor.tokenizer.set_target_lang("wol")


# Process the audio
input_dict = processor(
    dataset['train'][322]["audio"]["array"],
    sampling_rate=16_000,
    return_tensors="pt",
    padding=True
)

# Move inputs to the appropriate device for the first processing layer
input_values = input_dict.input_values.to(device, dtype=torch.float16)

# Perform inference
logits = model(input_values).logits

# Decode predictions
pred_ids = torch.argmax(logits, dim=-1)[0]

print("Prediction:")
print(processor.decode(pred_ids))

print("\nReference:")
print(dataset['train'][322]['transcription'].lower())

Advanced Usage

from transformers import pipeline
import torch

# Model ID
model_id = "bilalfaye/wav2vec2-large-mms-1b-wolof"

# Determine device (use GPU if available, otherwise fallback to CPU)
device = 0 if torch.cuda.is_available() else -1

# Use half precision (float16) for inference if GPU is available
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

# Set up the pipeline for automatic speech recognition
pipe = pipeline(
    task="automatic-speech-recognition", 
    model=model_id, 
    processor=model_id, 
    device=device,  # Specify the device (GPU if available, otherwise CPU)
    torch_dtype=torch_dtype,  # Set the precision (float16 for half precision, float32 otherwise)
    framework="pt"  # Use PyTorch as the framework
)

# Input audio processing
audio_array = dataset['train'][322]["audio"]["array"]  # Fetching an audio sample

# Run inference
result = pipe(audio_array)

# Prediction
print("Prediction:")
print(result['text'])

# Reference (for comparison)
print("\nReference:")
print(dataset['train'][322]['transcription'].lower())

Free memory

import gc
import torch
import psutil

# Free up unused memory in CUDA (GPU) - only needed if you use a GPU
if torch.cuda.is_available():
    torch.cuda.empty_cache()  # Clears GPU memory cache
    torch.cuda.reset_peak_memory_stats()  # Resets memory stats

# Collect any unused memory in Python (CPU)
gc.collect()  # Collect unused memory in Python's garbage collector

# Optionally, check memory status after clearing
if torch.cuda.is_available():
    print(f"GPU Memory Allocated: {torch.cuda.memory_allocated()} bytes")
    print(f"GPU Memory Cached: {torch.cuda.memory_reserved()} bytes")
else:
    print(f"CPU Memory Usage: {psutil.virtual_memory().percent}%")

📚 Documentation

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 0.0001
train_batch_size: 16
eval_batch_size: 8
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e - 08
lr_scheduler_type: linear
lr_scheduler_warmup_steps: 500
num_epochs: 20
mixed_precision_training: Native AMP

Training results

Training Loss	Epoch	Step	Validation Loss	Wer
0.3793	14.0	12250	0.1517	0.1888
0.3709	15.0	13125	0.1512	0.1882
0.3702	16.0	14000	0.1499	0.1858
0.367	17.0	14875	0.1492	0.1848
0.3656	18.0	15750	0.1493	0.1842

Framework versions

Transformers 4.41.2
Pytorch 2.4.0+cu121
Datasets 3.2.0
Tokenizers 0.19.1

🔧 Technical Details

This model is based on the Wav2Vec 2.0 architecture, which has been fine - tuned for speech recognition tasks. The base model, facebook/mms-1b-all, was trained on a multilingual corpus for general - purpose ASR. This fine - tuned version has been specifically trained on the Waxal Wolof dataset, which contains audio recordings in the Wolof language. The model was trained on the Isma/alffa_wolof dataset to improve accuracy on the specific phonetic characteristics of Wolof speech.

📄 License

This project is licensed under the MIT license.

Intended uses & limitations

Intended uses: This model is intended for speech - to - text tasks in Wolof. It can be used to transcribe audio recordings in Wolof into written text.
Limitations: This model performs best with clean audio and may struggle with noisy or low - quality recordings. It is designed specifically for the Wolof language and may not work well with other languages.

Author Information

Author: Bilal FAYE

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご