wav2vec2-large-xlsr-53-sw Open Source Model - Support for Automatic Swahili Recognition of 16kHz Voice

Wav2vec2 Large Xlsr 53 Sw

Developed by alokmatta

Swahili automatic speech recognition model fine-tuned on XLSR-53 large model, supports 16kHz sampling rate audio input

Speech Recognition OtherOpen Source License:Apache-2.0 #Swahili speech recognition #Low-resource speech processing #XLSR fine-tuned model

Downloads 158

Release Time : 3/2/2022

Model Overview

This model is an automatic speech recognition (ASR) model fine-tuned on Swahili datasets based on Facebook's wav2vec2-large-xlsr-53 model, capable of converting Swahili speech to text.

Model Features

Multi-dataset Fine-tuning

Fine-tuned on three Swahili datasets (ALFFA, Gamayun, and IWSLT) to improve recognition accuracy

16kHz Sampling Rate Support

Optimized specifically for 16kHz sampling rate audio input

No Language Model Required

Can be used directly without additional language model support

Model Capabilities

Swahili speech recognition

Speech-to-text

Automatic speech transcription

Use Cases

Speech Transcription

Swahili Speech Transcription

Convert Swahili speech content into text format

Test WER of 40%

Voice Assistants

Swahili Voice Interaction

Provide speech recognition capability for Swahili voice assistants

🚀 Wav2Vec2-Large-XLSR-53-Swahili

This model is a fine - tuned version of facebook/wav2vec2-large-xlsr-53 on Swahili, aiming to provide high - quality automatic speech recognition for Swahili.

🚀 Quick Start

This model is a fine - tuned version of facebook/wav2vec2-large-xlsr-53 on Swahili. It uses the following datasets:

When using this model, ensure that your speech input is sampled at 16kHz.

✨ Features

Datasets: Utilizes ALFFA, Gamayun, and IWSLT datasets for fine - tuning.
Metrics: Evaluated using Word Error Rate (WER).
Task: Specialized for automatic speech recognition in Swahili.

Property	Details
Model Type	Swahili XLSR - 53 Wav2Vec2.0 Large
Training Data	ALFFA, Gamayun, IWSLT

💻 Usage Examples

Basic Usage

import torch
import torchaudio
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor


processor = Wav2Vec2Processor.from_pretrained("alokmatta/wav2vec2-large-xlsr-53-sw")

model = Wav2Vec2ForCTC.from_pretrained("alokmatta/wav2vec2-large-xlsr-53-sw").to("cuda")

resampler = torchaudio.transforms.Resample(48_000, 16_000)

resampler = torchaudio.transforms.Resample(orig_freq=48_000, new_freq=16_000)

def load_file_to_data(file):
    batch = {}
    speech, _ = torchaudio.load(file)
    batch["speech"] = resampler.forward(speech.squeeze(0)).numpy()
    batch["sampling_rate"] = resampler.new_freq
    return batch


def predict(data):
    features = processor(data["speech"], sampling_rate=data["sampling_rate"], padding=True, return_tensors="pt")
    input_values = features.input_values.to("cuda")
    attention_mask = features.attention_mask.to("cuda")
    with torch.no_grad():
        logits = model(input_values, attention_mask=attention_mask).logits
    pred_ids = torch.argmax(logits, dim=-1)
    return processor.batch_decode(pred_ids)

predict(load_file_to_data('./demo.wav'))

Test Result: 40 %

📚 Documentation

The script used for training can be found here

📄 License

This model is licensed under the Apache - 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご