Sinai-Voice-AR-STT Open-Source Speech Recognition Model - Free Deployment for Precise Arabic Speech Recognition

Home

Sinai Voice Ar Stt

Developed by bakrianoo

An Arabic speech recognition model fine-tuned from facebook/wav2vec2-xls-r-300m on the Common Voice Arabic dataset

Speech Recognition

Transformers

ArabicOpen Source License:Apache-2.0 #Arabic speech recognition #Low word error rate #Common Voice dataset

Downloads 29

Release Time : 3/2/2022

Model Overview

This is an Arabic Automatic Speech Recognition (ASR) model capable of converting Arabic speech into text. The model was fine-tuned on the Common Voice Arabic dataset and is suitable for standard Arabic speech recognition tasks.

Model Features

High-performance Arabic recognition

Achieved 18.1% Word Error Rate (WER) and 4.9% Character Error Rate (CER) on the Common Voice Arabic test set

Based on large-scale pretrained model

Fine-tuned from facebook/wav2vec2-xls-r-300m, inheriting powerful speech feature extraction capabilities

Supports language model-free inference

Can perform speech recognition directly without additional language model support

Model Capabilities

Arabic speech recognition

Speech-to-text

Automatic speech recognition

Use Cases

Speech transcription

Arabic speech transcription

Convert Arabic speech content into text

18.1% word error rate on standard Arabic test set

Voice assistants

Arabic voice command recognition

Used for voice command recognition in Arabic voice assistant systems

🚀 Sinai Voice Arabic Speech Recognition Model

A fine - tuned model for Arabic speech recognition, converting Arabic speech into text.

🚀 Quick Start

This model is a fine - tuned version of facebook/wav2vec2-xls-r-300m on the MOZILLA - FOUNDATION/COMMON_VOICE_8_0 - AR dataset. It achieves the following results on the evaluation set:

Loss: 0.2141
Wer: 0.1808
eval_loss = 0.2141
eval_samples = 10388
eval_wer = 0.181
eval_cer = 0.049

Evaluation Commands

To evaluate on mozilla - foundation/common_voice_8_0 with split test

python eval.py --model_id bakrianoo/sinai-voice-ar-stt --dataset mozilla-foundation/common_voice_8_0 --config ar --split test

💻 Usage Examples

Basic Usage

from transformers import (Wav2Vec2Processor, Wav2Vec2ForCTC)
import torchaudio
import torch

def speech_file_to_array_fn(voice_path, resampling_to=16000):
    speech_array, sampling_rate = torchaudio.load(voice_path)
    resampler = torchaudio.transforms.Resample(sampling_rate, resampling_to)
    
    return resampler(speech_array)[0].numpy(), sampling_rate

# load the model
cp = "bakrianoo/sinai-voice-ar-stt"
processor = Wav2Vec2Processor.from_pretrained(cp)
model = Wav2Vec2ForCTC.from_pretrained(cp)

# recognize the text in a sample sound file
sound_path = './my_voice.mp3'

sample, sr = speech_file_to_array_fn(sound_path)
inputs = processor([sample], sampling_rate=16_000, return_tensors="pt", padding=True)

with torch.no_grad():
    logits = model(inputs.input_values,).logits

predicted_ids = torch.argmax(logits, dim=-1)

print("Prediction:", processor.batch_decode(predicted_ids))

🔧 Technical Details

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 0.0002
train_batch_size: 32
eval_batch_size: 10
seed: 42
distributed_type: multi - GPU
num_devices: 8
total_train_batch_size: 256
total_eval_batch_size: 80
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e - 08
lr_scheduler_type: linear
lr_scheduler_warmup_steps: 1000
num_epochs: 10
mixed_precision_training: Native AMP

Training results

Training Loss	Epoch	Step	Validation Loss	Wer
1.354	0.64	1000	0.4109	0.4493
0.5886	1.28	2000	0.2798	0.3099
0.4977	1.92	3000	0.2387	0.2673
0.4253	2.56	4000	0.2266	0.2523
0.3942	3.2	5000	0.2171	0.2437
0.3619	3.84	6000	0.2076	0.2253
0.3245	4.48	7000	0.2088	0.2186
0.308	5.12	8000	0.2086	0.2206
0.2881	5.76	9000	0.2089	0.2105
0.2557	6.4	10000	0.2015	0.2004
0.248	7.04	11000	0.2044	0.1953
0.2251	7.68	12000	0.2058	0.1932
0.2052	8.32	13000	0.2117	0.1878
0.1976	8.96	14000	0.2104	0.1825
0.1845	9.6	15000	0.2156	0.1821

Framework versions

Transformers 4.16.2
Pytorch 1.10.2+cu113
Datasets 1.18.3
Tokenizers 0.11.0

📄 License

This model is licensed under the Apache - 2.0 license.

📚 Documentation

Model Information

Property	Details
Model Type	Sinai Voice Arabic Speech Recognition Model
Training Data	mozilla - foundation/common_voice_8_0
Metrics	wer, cer

Model Results

The model has the following results:

task: automatic - speech - recognition
- dataset: mozilla - foundation/common_voice_8_0 (Common Voice ar, args: ar)
  - metrics:
    - wer: 0.181 (Test WER)
    - cer: 0.049 (Test CER)
- dataset: speech - recognition - community - v2/dev_data (Robust Speech Event - Dev Data, args: ar)
  - metrics:
    - wer: 93.03 (Test WER)
- dataset: speech - recognition - community - v2/eval_data (Robust Speech Event - Test Data, args: ar)
  - metrics:
    - wer: 90.79 (Test WER)

Widget Examples

Example 1: Audio
Example 2: Audio
Example 3: Audio

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご