đ Sharif-wav2vec2
This is a fine - tuned model based on Sharif Wav2vec2 for Farsi speech recognition, enhancing accuracy with specific training and a 5 - gram model.
đ Quick Start
When using the model, ensure that your speech input is sampled at 16Khz. Prior to the usage, you may need to install the below dependencies:
pip install pyctcdecode
pip install pypi-kenlm
⨠Features
- Fine - tuned on 108 hours of Commonvoice's Farsi samples with a 16kHz sampling rate.
- Utilizes a 5 - gram model trained with kenlm toolkit to improve online ASR accuracy.
đĻ Installation
Before using the model, you need to install the following dependencies:
pip install pyctcdecode
pip install pypi-kenlm
đģ Usage Examples
Basic Usage
You can use the hosted inference API at the Hugging Face (There are provided examples from common - voice). It may take a while to transcribe the given voice; Or you can use the below code for a local run:
import tensorflow
import torchaudio
import torch
import numpy as np
from transformers import AutoProcessor, AutoModelForCTC
processor = AutoProcessor.from_pretrained("SLPL/Sharif-wav2vec2")
model = AutoModelForCTC.from_pretrained("SLPL/Sharif-wav2vec2")
speech_array, sampling_rate = torchaudio.load("path/to/your.wav")
speech_array = speech_array.squeeze().numpy()
features = processor(
speech_array,
sampling_rate=processor.feature_extractor.sampling_rate,
return_tensors="pt",
padding=True)
with torch.no_grad():
logits = model(
features.input_values,
attention_mask=features.attention_mask).logits
prediction = processor.batch_decode(logits.numpy()).text
print(prediction[0])
đ Documentation
Evaluation
For the evaluation, you can use the code below. Ensure your dataset to be in following form in order to avoid any further conflict:
path |
reference |
path/to/audio_file.wav |
"TRANSCRIPTION" |
Also, make sure you have installed pip install jiwer
prior to running.
import tensorflow
import torchaudio
import torch
import librosa
from datasets import load_dataset,load_metric
import numpy as np
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
from transformers import Wav2Vec2ProcessorWithLM
model = Wav2Vec2ForCTC.from_pretrained("SLPL/Sharif-wav2vec2")
processor = Wav2Vec2ProcessorWithLM.from_pretrained("SLPL/Sharif-wav2vec2")
def speech_file_to_array_fn(batch):
speech_array, sampling_rate = torchaudio.load(batch["path"])
speech_array = speech_array.squeeze().numpy()
speech_array = librosa.resample(
np.asarray(speech_array),
sampling_rate,
processor.feature_extractor.sampling_rate)
batch["speech"] = speech_array
return batch
def predict(batch):
features = processor(
batch["speech"],
sampling_rate=processor.feature_extractor.sampling_rate,
return_tensors="pt",
padding=True
)
with torch.no_grad():
logits = model(
features.input_values,
attention_mask=features.attention_mask).logits
batch["prediction"] = processor.batch_decode(logits.numpy()).text
return batch
dataset = load_dataset(
"csv",
data_files={"test":"dataset.eval.csv"},
delimiter=",")["test"]
dataset = dataset.map(speech_file_to_array_fn)
result = dataset.map(predict, batched=True, batch_size=4)
wer = load_metric("wer")
print("WER: {:.2f}".format(wer.compute(
predictions=result["prediction"],
references=result["reference"])))
Result (WER) on common - voice 6.1:
đ License
This project is licensed under the MIT license.
đ Citation
If you want to cite this model you can use this:
?
đ Contributions
Thanks to @sarasadeghii and @sadrasabouri for adding this model.
đ Model Information
Property |
Details |
Model Type |
Sharif-wav2vec2 fine - tuned for Farsi |
Training Data |
108 hours of Commonvoice's Farsi samples with 16kHz sampling rate |
Tags |
audio, automatic - speech - recognition |
Datasets |
common_voice_6_1 |
đ¯ Widget Examples
- Common Voice Sample 1: Audio
- Common Voice Sample 2: Audio
đ Model Index
- Model Name: Sharif-wav2vec2
- Task: Automatic Speech Recognition
- Dataset: Common Voice Corpus 6.1 (clean)
- Metrics: Test WER = 6.0