Wav2vec2-large-xlsr-arabic Open-source Speech Recognition Model

Wav2vec2 Large Xlsr Arabic

Developed by mohammed

A Wav2Vec2-Large-XLSR-53 model fine-tuned for Arabic speech recognition, trained on the Common Voice and Arabic Speech Corpus datasets

Speech Recognition ArabicOpen Source License:Apache-2.0 #Arabic Speech Recognition #XLSR-53 Fine-tuning #No Language Model Dependency

Downloads 51

Release Time : 3/2/2022

Model Overview

This model is an Arabic Automatic Speech Recognition (ASR) model fine-tuned from facebook/wav2vec2-large-xlsr-53, supporting speech input recognition at 16kHz sampling rate.

Model Features

Arabic Language Optimization

Specially fine-tuned for Arabic speech characteristics, handling diacritics and special characters

No Language Model Required

Can be used directly for speech recognition without additional language model support

Multi-dataset Training

Trained on both Common Voice and Arabic Speech Corpus datasets to improve generalization capability

Model Capabilities

Arabic Speech Recognition

Audio to Text Conversion

16kHz Sampling Rate Processing

Use Cases

Speech Transcription

Arabic Speech to Text

Convert Arabic speech content into text transcripts

Test set Word Error Rate 36.699%

Voice Assistants

Arabic Voice Command Recognition

Basic recognition function for Arabic voice assistants

🚀 Fine-tuned Wav2Vec2-Large-XLSR-53 Model for Arabic Speech Recognition

This project presents a fine-tuned facebook/wav2vec2-large-xlsr-53 model for Arabic speech recognition. It uses the train splits of Common Voice and Arabic Speech Corpus. Ensure your speech input is sampled at 16kHz when using this model.

🚀 Quick Start

✨ Features

Fine-tuned on Arabic datasets for improved speech recognition in Arabic.
Can be used directly without a language model.

📦 Installation

To use this model, you need to install the following dependencies:

%%capture
!pip install datasets
!pip install transformers==4.4.0
!pip install torchaudio
!pip install jiwer
!pip install tnkeeh

💻 Usage Examples

Basic Usage

The model can be used directly (without a language model) as follows:

import torch
import torchaudio
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

test_dataset = load_dataset("common_voice", "ar", split="test[:2%]")

processor = Wav2Vec2Processor.from_pretrained("mohammed/wav2vec2-large-xlsr-arabic")
model = Wav2Vec2ForCTC.from_pretrained("mohammed/wav2vec2-large-xlsr-arabic")

resampler = torchaudio.transforms.Resample(48_000, 16_000)

# Preprocessing the datasets.
# We need to read the audio files as arrays
def speech_file_to_array_fn(batch):
    speech_array, sampling_rate = torchaudio.load(batch["path"])
    batch["speech"] = resampler(speech_array).squeeze().numpy()
    return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)
inputs = processor(test_dataset["speech"][:2], sampling_rate=16_000, return_tensors="pt", padding=True)

with torch.no_grad():
    logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits

predicted_ids = torch.argmax(logits, dim=-1)

print("The predicted sentence is: ", processor.batch_decode(predicted_ids))
print("The original sentence is:", test_dataset["sentence"][:2])

The output is:

The predicted sentence is : ['ألديك قلم', 'ليست نارك مكسافة على هذه الأرض أبعد من يوم أمس']
The original sentence is: ['ألديك قلم ؟', 'ليست هناك مسافة على هذه الأرض أبعد من يوم أمس.']

🔧 Technical Details

Model Type: Fine-tuned Wav2Vec2-Large-XLSR-53
Training Data: train splits of Common Voice and Arabic Speech Corpus

Property	Details
Model Type	Fine-tuned Wav2Vec2-Large-XLSR-53
Training Data	`train` splits of Common Voice and Arabic Speech Corpus

📚 Documentation

Evaluation

The model can be evaluated as follows on the Arabic test data of Common Voice:

import torch
import torchaudio
from datasets import load_dataset, load_metric
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import re
# creating a dictionary with all diacritics
dict = {
'ِ': '',
'ُ': '', 
'ٓ': '', 
'ٰ': '', 
'ْ': '', 
'ٌ': '', 
'ٍ': '', 
'ً': '', 
'ّ': '', 
'َ': '',
'~': '',
',': '',
'ـ': '',
'—': '',
'.': '',
'!': '',
'-': '',
';': '',
':': '',
'\'': '',
'"': '',
'☭': '',
'«': '',
'»': '',
'؛': '',
'ـ': '',
'_': '',
'،': '',
'“': '',
'%': '',
'‘': '',
'”': '',
'�': '',
'_': '',
',': '',
'?': '',
'#': '',
'‘': '',
'.': '',
'؛': '',
'get': '',
'؟': '',
'  ': ' ',
'\'ۖ ': '',
'\'': '',
 '\'ۚ' : '',
 ' \'': '',
 '31': '', 
 '24': '',
 '39': ''
} 

# replacing multiple diacritics using dictionary (stackoverflow is amazing)
def remove_special_characters(batch):
  # Create a regular expression  from the dictionary keys
  regex = re.compile("(%s)" % "|".join(map(re.escape, dict.keys())))
  # For each match, look-up corresponding value in dictionary
  batch["sentence"] = regex.sub(lambda mo: dict[mo.string[mo.start():mo.end()]], batch["sentence"])
  return batch 
  

test_dataset = load_dataset("common_voice", "ar", split="test") 
wer = load_metric("wer")

processor = Wav2Vec2Processor.from_pretrained("mohammed/wav2vec2-large-xlsr-arabic") 
model = Wav2Vec2ForCTC.from_pretrained("mohammed/wav2vec2-large-xlsr-arabic")
model.to("cuda")


resampler = torchaudio.transforms.Resample(48_000, 16_000)

# Preprocessing the datasets.
# We need to read the audio files as arrays
def speech_file_to_array_fn(batch):
    speech_array, sampling_rate = torchaudio.load(batch["path"])
    batch["speech"] = resampler(speech_array).squeeze().numpy()
    return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)
test_dataset = test_dataset.map(remove_special_characters)
# Preprocessing the datasets.
# We need to read the audio files as arrays
def evaluate(batch):
    inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)

    with torch.no_grad():
        logits = model(inputs.input_values.to("cuda"), attention_mask=inputs.attention_mask.to("cuda")).logits

    pred_ids = torch.argmax(logits, dim=-1)
    batch["pred_strings"] = processor.batch_decode(pred_ids)
    return batch

result = test_dataset.map(evaluate, batched=True, batch_size=8)

print("WER: {:2f}".format(100 * wer.compute(predictions=result["pred_strings"], references=result["sentence"])))

Test Result: 36.699%

Future Work

One can use data augmentation, transliteration, or attention_mask to increase the accuracy.

📄 License

This project is licensed under the Apache-2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご