wav2vec2-large-xlsr-egyptian Open-source Model - Supports 16kHz and Accurately Recognizes Egyptian Arabic Speech

Wav2vec2 Large Xlsr Egyptian

Developed by othrif

An automatic speech recognition model for Egyptian Arabic, fine-tuned from the facebook/wav2vec2-large-xlsr-53 model, supporting speech input at 16kHz sampling rate.

Speech Recognition

Transformers

OtherOpen Source License:Apache-2.0 #Egyptian Arabic ASR #XLSR fine-tuning #Low-resource speech recognition

Downloads 19

Release Time : 3/2/2022

Model Overview

This model is an optimized automatic speech recognition (ASR) model for Egyptian Arabic, fine-tuned based on the XLSR-53 architecture, suitable for speech-to-text tasks in Egyptian Arabic.

Model Features

Optimized for Egyptian Arabic

Specially fine-tuned for the Egyptian Arabic dialect, better recognizing its phonetic characteristics.

No Language Model Required

Can be used directly without additional language model support, simplifying deployment.

16kHz Sampling Rate Support

Supports standard 16kHz sampling rate speech input, compatible with common voice capture devices.

Model Capabilities

Egyptian Arabic speech recognition

Speech-to-text

Automatic speech recognition

Use Cases

Speech Transcription

Egyptian Arabic Speech Transcription

Convert Egyptian Arabic speech content into text

WER of 55.2 on the arabicspeech.org MGB-3 dataset

Voice Assistants

Egyptian Arabic Voice Command Recognition

For voice assistants and smart devices supporting Egyptian Arabic

🚀 Wav2Vec2-Large-XLSR-53-Egyptian-Arabic

This model is a fine - tuned version of facebook/wav2vec2-large-xlsr-53 for Egyptian Arabic, leveraging the arabicspeech.org MGB - 3. Ensure your speech input is sampled at 16kHz when using this model.

🚀 Quick Start

This model is a fine - tuned version of facebook/wav2vec2-large-xlsr-53 for Egyptian Arabic, using the arabicspeech.org MGB - 3. When using this model, make sure your speech input is sampled at 16kHz.

✨ Features

Language: Egyptian Arabic
Based on: facebook/wav2vec2-large-xlsr-53
Dataset: arabicspeech.org MGB - 3

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

import torch
import torchaudio
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

test_dataset = load_dataset("common_voice", "ar", split="test[:2%]")

processor = Wav2Vec2Processor.from_pretrained("othrif/wav2vec2-large-xlsr-egyptian")
model = Wav2Vec2ForCTC.from_pretrained("othrif/wav2vec2-large-xlsr-egyptian")

resampler = torchaudio.transforms.Resample(48_000, 16_000)

# Preprocessing the datasets.
# We need to read the audio files as arrays
def speech_file_to_array_fn(batch):
	speech_array, sampling_rate = torchaudio.load(batch["path"])
	batch["speech"] = resampler(speech_array).squeeze().numpy()
	return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)
inputs = processor(test_dataset["speech"][:2], sampling_rate=16_000, return_tensors="pt", padding=True)

with torch.no_grad():
	logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits

predicted_ids = torch.argmax(logits, dim=-1)

print("Prediction:", processor.batch_decode(predicted_ids))
print("Reference:", test_dataset["sentence"][:2])

Advanced Usage

import torch
import torchaudio
from datasets import load_dataset, load_metric
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import re

test_dataset = load_dataset("common_voice", "ar", split="test") 
wer = load_metric("wer")

processor = Wav2Vec2Processor.from_pretrained("othrif/wav2vec2-large-xlsr-egyptian") 
model = Wav2Vec2ForCTC.from_pretrained("othrif/wav2vec2-large-xlsr-egyptian")
model.to("cuda")

chars_to_ignore_regex = '[\؛\—\_get\«\»\ـ\ـ\,\?\.\!\-\;\:\"\“\%\‘\”\�\#\،\☭,\؟]'
resampler = torchaudio.transforms.Resample(48_000, 16_000)

# Preprocessing the datasets.
# We need to read the audio files as arrays
def speech_file_to_array_fn(batch):
	batch["sentence"] = re.sub(chars_to_ignore_regex, '', batch["sentence"]).lower()
	speech_array, sampling_rate = torchaudio.load(batch["path"])
	batch["speech"] = resampler(speech_array).squeeze().numpy()
	return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)

# Preprocessing the datasets.
# We need to read the audio files as arrays
def evaluate(batch):
	inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)

	with torch.no_grad():
		logits = model(inputs.input_values.to("cuda"), attention_mask=inputs.attention_mask.to("cuda")).logits

	pred_ids = torch.argmax(logits, dim=-1)
	batch["pred_strings"] = processor.batch_decode(pred_ids)
	return batch

result = test_dataset.map(evaluate, batched=True, batch_size=8)

print("WER: {:2f}".format(100 * wer.compute(predictions=result["pred_strings"], references=result["sentence"])))

📚 Documentation

Test Result

The model achieved a Word Error Rate (WER) of 55.2 on the test dataset.

Training

The Common Voice train and validation datasets were used for training. The training script can be found here.

🔧 Technical Details

The model is based on the fine - tuning of facebook/wav2vec2-large-xlsr-53 on Egyptian Arabic data from arabicspeech.org MGB - 3.

📄 License

This model is licensed under the Apache 2.0 license.

📊 Model Information

Property	Details
Model Type	Wav2Vec2 - Large - XLSR - 53 - Egyptian - Arabic
Training Data	Common Voice `train`, `validation` datasets; arabicspeech.org MGB - 3
Metrics	Word Error Rate (WER)
License	apache - 2.0
Model Index	Name: XLSR Wav2Vec2 Egyptian Arabic by Othmane Rifki; Results: Speech Recognition task on arabicspeech.org MGB - 3 dataset with a Test WER of 55.2
Tags	audio, automatic - speech - recognition, speech, xlsr - fine - tuning - week
Datasets	https://arabicspeech.org/

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご