wav2vec2-large-xlsr-arabic Open-source Automatic Speech Recognition Model - Empowering Precise Recognition of Arabic Speech

Wav2vec2 Large Xlsr Arabic

Developed by kmfoda

An automatic speech recognition model fine-tuned on the Arabic Common Voice dataset based on facebook/wav2vec2-large-xlsr-53

Speech Recognition ArabicOpen Source License:Apache-2.0 #Arabic speech recognition #XLSR fine-tuning #16kHz sampling rate

Downloads 19

Release Time : 3/2/2022

Model Overview

This is an automatic speech recognition (ASR) model optimized for Arabic, capable of converting Arabic speech into text.

Model Features

Arabic Optimization

Specially fine-tuned for Arabic speech recognition tasks

Multi-Sampling Rate Support

Can process audio inputs with various sampling rates such as 48kHz, 44.1kHz, and 32kHz

No Language Model Required

Can be used directly without additional language model support

Model Capabilities

Arabic speech recognition

Speech-to-text

Multi-sampling rate audio processing

Use Cases

Speech Transcription

Arabic Speech Transcription

Convert Arabic speech content into text

WER of 46.77 on the Common Voice test set

Voice Assistants

Arabic Voice Command Recognition

Used for command recognition in Arabic voice assistant systems

🚀 Wav2Vec2-Large-XLSR-53-Arabic

This model is fine-tuned from facebook/wav2vec2-large-xlsr-53 on Arabic using the Common Voice. It's designed for automatic speech recognition in Arabic.

📋 Model Information

Property	Details
Model Type	Wav2Vec2-Large-XLSR-53 fine-tuned for Arabic
Training Data	Common Voice `train` and `validation` datasets for Arabic
Metrics	Word Error Rate (WER)
Tags	audio, automatic-speech-recognition, speech, xlsr-fine-tuning-week
License	apache-2.0

📊 Model Index

Name: XLSR Wav2Vec2 Arabic by Othmane Rifki
Results:
- Task:
  - Name: Speech Recognition
  - Type: automatic-speech-recognition
- Dataset:
  - Name: Common Voice ar
  - Type: common_voice
  - Args: ar
- Metrics:
  - Name: Test WER
  - Type: wer
  - Value: 46.77

🚀 Quick Start

When using this model, make sure that your speech input is sampled at 16kHz.

✨ Features

Fine-tuned on Arabic data from Common Voice.
Can be used for automatic speech recognition in Arabic without a language model.

💻 Usage Examples

Basic Usage

import librosa
import torch
import torchaudio
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

test_dataset = load_dataset("common_voice", "ar", split="test[:2%]")

processor = Wav2Vec2Processor.from_pretrained("kmfoda/wav2vec2-large-xlsr-arabic")
model = Wav2Vec2ForCTC.from_pretrained("kmfoda/wav2vec2-large-xlsr-arabic")

resamplers = {  # all three sampling rates exist in test split
    48000: torchaudio.transforms.Resample(48000, 16000),
    44100: torchaudio.transforms.Resample(44100, 16000),
    32000: torchaudio.transforms.Resample(32000, 16000),
}

def prepare_example(example):
    speech, sampling_rate = torchaudio.load(example["path"])
    example["speech"] = resamplers[sampling_rate](speech).squeeze().numpy()
    return example

test_dataset = test_dataset.map(prepare_example)

inputs = processor(test_dataset["speech"][:2], sampling_rate=16_000, return_tensors="pt", padding=True)

with torch.no_grad():
    logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits

predicted_ids = torch.argmax(logits, dim=-1)

print("Prediction:", processor.batch_decode(predicted_ids))
print("Reference:", test_dataset["sentence"][:2])

📚 Documentation

Evaluation

The model can be evaluated as follows on the Arabic test data of Common Voice.

import librosa
import torch
import torchaudio
from datasets import load_dataset, load_metric
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import re

test_dataset = load_dataset("common_voice", "ar", split="test") 
wer = load_metric("wer")
processor = Wav2Vec2Processor.from_pretrained("kmfoda/wav2vec2-large-xlsr-arabic") 
model = Wav2Vec2ForCTC.from_pretrained("kmfoda/wav2vec2-large-xlsr-arabic")
model.to("cuda")

chars_to_ignore_regex = '[\,\?\.\!\-\;\:\"\“\؟\_\؛\ـ\—]'

resamplers = {  # all three sampling rates exist in test split
    48000: torchaudio.transforms.Resample(48000, 16000),
    44100: torchaudio.transforms.Resample(44100, 16000),
    32000: torchaudio.transforms.Resample(32000, 16000),
}

def prepare_example(example):
    speech, sampling_rate = torchaudio.load(example["path"])
    example["speech"] = resamplers[sampling_rate](speech).squeeze().numpy()
    return example

test_dataset = test_dataset.map(prepare_example)

# Preprocessing the datasets.
# We need to read the audio files as arrays
def evaluate(batch):
    inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)

    with torch.no_grad():
        logits = model(inputs.input_values.to("cuda"), attention_mask=inputs.attention_mask.to("cuda")).logits

        pred_ids = torch.argmax(logits, dim=-1)
        batch["pred_strings"] = processor.batch_decode(pred_ids)
        return batch

result = test_dataset.map(evaluate, batched=True, batch_size=8)

print("WER: {:2f}".format(100 * wer.compute(predictions=result["pred_strings"], references=result["sentence"])))

Test Result: 52.53

Training

The Common Voice train, validation datasets were used for training.

The script used for training can be found here

📄 License

This model is released under the apache-2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご