đ wav2vec2-large-mms-1b-wolof
This model is a fine - tuned version of facebook/mms-1b-all on the Isma/alffa_wolof dataset, designed for automatic speech recognition in the Wolof language.
đ Quick Start
This model is a fine - tuned version of facebook/mms-1b-all on the Isma/alffa_wolof dataset. It is designed to perform automatic speech recognition (ASR) in the Wolof language.
⨠Features
- Based on the Wav2Vec 2.0 architecture, fine - tuned for speech recognition tasks.
- Specifically trained on the Waxal Wolof dataset to handle the phonetic characteristics of Wolof speech.
đĻ Installation
No specific installation steps are provided in the original document.
đģ Usage Examples
Basic Usage
! pip install datasets
from datasets import load_dataset, Audio
dataset = load_dataset("perrynelson/waxal-wolof", trust_remote_code=True)
dataset
from IPython.display import Audio, display
Audio(dataset['train'][322]['audio']['array'], rate=16000)
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import torch
model_id = "bilalfaye/wav2vec2-large-mms-1b-wolof"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = Wav2Vec2ForCTC.from_pretrained(model_id,
target_lang="wol",
torch_dtype=torch.float16
).to(device)
processor = Wav2Vec2Processor.from_pretrained(model_id)
processor.tokenizer.set_target_lang("wol")
input_dict = processor(
dataset['train'][322]["audio"]["array"],
sampling_rate=16_000,
return_tensors="pt",
padding=True
)
input_values = input_dict.input_values.to(device, dtype=torch.float16)
logits = model(input_values).logits
pred_ids = torch.argmax(logits, dim=-1)[0]
print("Prediction:")
print(processor.decode(pred_ids))
print("\nReference:")
print(dataset['train'][322]['transcription'].lower())
Advanced Usage
from transformers import pipeline
import torch
model_id = "bilalfaye/wav2vec2-large-mms-1b-wolof"
device = 0 if torch.cuda.is_available() else -1
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
pipe = pipeline(
task="automatic-speech-recognition",
model=model_id,
processor=model_id,
device=device,
torch_dtype=torch_dtype,
framework="pt"
)
audio_array = dataset['train'][322]["audio"]["array"]
result = pipe(audio_array)
print("Prediction:")
print(result['text'])
print("\nReference:")
print(dataset['train'][322]['transcription'].lower())
Free memory
import gc
import torch
import psutil
if torch.cuda.is_available():
torch.cuda.empty_cache()
torch.cuda.reset_peak_memory_stats()
gc.collect()
if torch.cuda.is_available():
print(f"GPU Memory Allocated: {torch.cuda.memory_allocated()} bytes")
print(f"GPU Memory Cached: {torch.cuda.memory_reserved()} bytes")
else:
print(f"CPU Memory Usage: {psutil.virtual_memory().percent}%")
đ Documentation
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 0.0001
- train_batch_size: 16
- eval_batch_size: 8
- seed: 42
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e - 08
- lr_scheduler_type: linear
- lr_scheduler_warmup_steps: 500
- num_epochs: 20
- mixed_precision_training: Native AMP
Training results
Training Loss |
Epoch |
Step |
Validation Loss |
Wer |
0.3793 |
14.0 |
12250 |
0.1517 |
0.1888 |
0.3709 |
15.0 |
13125 |
0.1512 |
0.1882 |
0.3702 |
16.0 |
14000 |
0.1499 |
0.1858 |
0.367 |
17.0 |
14875 |
0.1492 |
0.1848 |
0.3656 |
18.0 |
15750 |
0.1493 |
0.1842 |
Framework versions
- Transformers 4.41.2
- Pytorch 2.4.0+cu121
- Datasets 3.2.0
- Tokenizers 0.19.1
đ§ Technical Details
This model is based on the Wav2Vec 2.0 architecture, which has been fine - tuned for speech recognition tasks. The base model, facebook/mms-1b-all, was trained on a multilingual corpus for general - purpose ASR. This fine - tuned version has been specifically trained on the Waxal Wolof dataset, which contains audio recordings in the Wolof language. The model was trained on the Isma/alffa_wolof dataset to improve accuracy on the specific phonetic characteristics of Wolof speech.
đ License
This project is licensed under the MIT license.
Intended uses & limitations
- Intended uses: This model is intended for speech - to - text tasks in Wolof. It can be used to transcribe audio recordings in Wolof into written text.
- Limitations: This model performs best with clean audio and may struggle with noisy or low - quality recordings. It is designed specifically for the Wolof language and may not work well with other languages.
Author Information