đ Sinai Voice Arabic Speech Recognition Model
A fine - tuned model for Arabic speech recognition, converting Arabic speech into text.
đ Quick Start
This model is a fine - tuned version of facebook/wav2vec2-xls-r-300m on the MOZILLA - FOUNDATION/COMMON_VOICE_8_0 - AR dataset. It achieves the following results on the evaluation set:
- Loss: 0.2141
- Wer: 0.1808
- eval_loss = 0.2141
- eval_samples = 10388
- eval_wer = 0.181
- eval_cer = 0.049
Evaluation Commands
- To evaluate on
mozilla - foundation/common_voice_8_0
with split test
python eval.py --model_id bakrianoo/sinai-voice-ar-stt --dataset mozilla-foundation/common_voice_8_0 --config ar --split test
đģ Usage Examples
Basic Usage
from transformers import (Wav2Vec2Processor, Wav2Vec2ForCTC)
import torchaudio
import torch
def speech_file_to_array_fn(voice_path, resampling_to=16000):
speech_array, sampling_rate = torchaudio.load(voice_path)
resampler = torchaudio.transforms.Resample(sampling_rate, resampling_to)
return resampler(speech_array)[0].numpy(), sampling_rate
cp = "bakrianoo/sinai-voice-ar-stt"
processor = Wav2Vec2Processor.from_pretrained(cp)
model = Wav2Vec2ForCTC.from_pretrained(cp)
sound_path = './my_voice.mp3'
sample, sr = speech_file_to_array_fn(sound_path)
inputs = processor([sample], sampling_rate=16_000, return_tensors="pt", padding=True)
with torch.no_grad():
logits = model(inputs.input_values,).logits
predicted_ids = torch.argmax(logits, dim=-1)
print("Prediction:", processor.batch_decode(predicted_ids))
đ§ Technical Details
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 0.0002
- train_batch_size: 32
- eval_batch_size: 10
- seed: 42
- distributed_type: multi - GPU
- num_devices: 8
- total_train_batch_size: 256
- total_eval_batch_size: 80
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e - 08
- lr_scheduler_type: linear
- lr_scheduler_warmup_steps: 1000
- num_epochs: 10
- mixed_precision_training: Native AMP
Training results
Training Loss |
Epoch |
Step |
Validation Loss |
Wer |
1.354 |
0.64 |
1000 |
0.4109 |
0.4493 |
0.5886 |
1.28 |
2000 |
0.2798 |
0.3099 |
0.4977 |
1.92 |
3000 |
0.2387 |
0.2673 |
0.4253 |
2.56 |
4000 |
0.2266 |
0.2523 |
0.3942 |
3.2 |
5000 |
0.2171 |
0.2437 |
0.3619 |
3.84 |
6000 |
0.2076 |
0.2253 |
0.3245 |
4.48 |
7000 |
0.2088 |
0.2186 |
0.308 |
5.12 |
8000 |
0.2086 |
0.2206 |
0.2881 |
5.76 |
9000 |
0.2089 |
0.2105 |
0.2557 |
6.4 |
10000 |
0.2015 |
0.2004 |
0.248 |
7.04 |
11000 |
0.2044 |
0.1953 |
0.2251 |
7.68 |
12000 |
0.2058 |
0.1932 |
0.2052 |
8.32 |
13000 |
0.2117 |
0.1878 |
0.1976 |
8.96 |
14000 |
0.2104 |
0.1825 |
0.1845 |
9.6 |
15000 |
0.2156 |
0.1821 |
Framework versions
- Transformers 4.16.2
- Pytorch 1.10.2+cu113
- Datasets 1.18.3
- Tokenizers 0.11.0
đ License
This model is licensed under the Apache - 2.0 license.
đ Documentation
Model Information
Property |
Details |
Model Type |
Sinai Voice Arabic Speech Recognition Model |
Training Data |
mozilla - foundation/common_voice_8_0 |
Metrics |
wer, cer |
Model Results
The model has the following results:
- task: automatic - speech - recognition
- dataset: mozilla - foundation/common_voice_8_0 (Common Voice ar, args: ar)
- metrics:
- wer: 0.181 (Test WER)
- cer: 0.049 (Test CER)
- dataset: speech - recognition - community - v2/dev_data (Robust Speech Event - Dev Data, args: ar)
- dataset: speech - recognition - community - v2/eval_data (Robust Speech Event - Test Data, args: ar)
Widget Examples