Sharif-wav2vec2 Open-Source Speech Model - Free Deployment to Boost Persian Automatic Speech Recognition

Sharif Wav2vec2

Developed by SLPL

A fine-tuned version of Sharif Wav2vec2 for Persian language, trained on Common Voice Persian samples, supporting automatic speech recognition tasks.

Speech Recognition

Transformers

OtherOpen Source License:MIT #Persian speech recognition #Low WER (6.0)#kenlm enhanced

Downloads 88

Release Time : 6/25/2022

Model Overview

This model is an automatic speech recognition (ASR) model based on the Wav2vec2 architecture, specifically fine-tuned for Persian. It was trained on 108 hours of Common Voice Persian samples and incorporates a 5-gram language model to improve recognition accuracy.

Model Features

Persian optimization

Specifically fine-tuned for Persian, achieving 6.0% WER on the Common Voice Persian test set.

Language model integration

Incorporates a 5-gram language model trained with kenlm, improving the accuracy of online ASR.

Efficient processing

Supports 16kHz sampling rate audio input, suitable for real-time speech recognition applications.

Model Capabilities

Persian speech recognition

Audio transcription

Speech-to-text

Use Cases

Speech transcription

Persian speech-to-text

Convert Persian speech content into text

Achieves 6% word error rate (WER) on the Common Voice test set

Voice assistant

Persian voice command recognition

Used for voice command recognition in Persian voice assistants or smart home systems

🚀 Sharif-wav2vec2

This is a fine - tuned model based on Sharif Wav2vec2 for Farsi speech recognition, enhancing accuracy with specific training and a 5 - gram model.

🚀 Quick Start

When using the model, ensure that your speech input is sampled at 16Khz. Prior to the usage, you may need to install the below dependencies:

pip install pyctcdecode
pip install pypi-kenlm

✨ Features

Fine - tuned on 108 hours of Commonvoice's Farsi samples with a 16kHz sampling rate.
Utilizes a 5 - gram model trained with kenlm toolkit to improve online ASR accuracy.

📦 Installation

Before using the model, you need to install the following dependencies:

pip install pyctcdecode
pip install pypi-kenlm

💻 Usage Examples

Basic Usage

You can use the hosted inference API at the Hugging Face (There are provided examples from common - voice). It may take a while to transcribe the given voice; Or you can use the below code for a local run:

import tensorflow
import torchaudio
import torch
import numpy as np

from transformers import AutoProcessor, AutoModelForCTC

processor = AutoProcessor.from_pretrained("SLPL/Sharif-wav2vec2")
model = AutoModelForCTC.from_pretrained("SLPL/Sharif-wav2vec2")

speech_array, sampling_rate = torchaudio.load("path/to/your.wav")
speech_array = speech_array.squeeze().numpy()

features = processor(
    speech_array,
    sampling_rate=processor.feature_extractor.sampling_rate,
    return_tensors="pt",
    padding=True)

with torch.no_grad():
    logits = model(
        features.input_values,
        attention_mask=features.attention_mask).logits
    prediction = processor.batch_decode(logits.numpy()).text

print(prediction[0])
# تست

📚 Documentation

Evaluation

For the evaluation, you can use the code below. Ensure your dataset to be in following form in order to avoid any further conflict:

path	reference
path/to/audio_file.wav	"TRANSCRIPTION"

Also, make sure you have installed pip install jiwer prior to running.

import tensorflow
import torchaudio
import torch
import librosa
from datasets import load_dataset,load_metric
import numpy as np
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
from transformers import Wav2Vec2ProcessorWithLM

model = Wav2Vec2ForCTC.from_pretrained("SLPL/Sharif-wav2vec2") 
processor = Wav2Vec2ProcessorWithLM.from_pretrained("SLPL/Sharif-wav2vec2") 

def speech_file_to_array_fn(batch):
    speech_array, sampling_rate = torchaudio.load(batch["path"])
    speech_array = speech_array.squeeze().numpy()
    speech_array = librosa.resample(
        np.asarray(speech_array),
        sampling_rate,
        processor.feature_extractor.sampling_rate)
    batch["speech"] = speech_array
    return batch

def predict(batch):
    features = processor(
        batch["speech"], 
        sampling_rate=processor.feature_extractor.sampling_rate, 
        return_tensors="pt", 
        padding=True
    )

    with torch.no_grad():
        logits = model(
            features.input_values,
            attention_mask=features.attention_mask).logits
    batch["prediction"] = processor.batch_decode(logits.numpy()).text
    return batch
    
dataset = load_dataset(
    "csv",
    data_files={"test":"dataset.eval.csv"},
    delimiter=",")["test"]
dataset = dataset.map(speech_file_to_array_fn)

result = dataset.map(predict, batched=True, batch_size=4)
wer = load_metric("wer")

print("WER: {:.2f}".format(wer.compute(
    predictions=result["prediction"],
    references=result["reference"])))

Result (WER) on common - voice 6.1:

cleaned	other
0.06	0.16

📄 License

This project is licensed under the MIT license.

📚 Citation

If you want to cite this model you can use this:

👏 Contributions

Thanks to @sarasadeghii and @sadrasabouri for adding this model.

📋 Model Information

Property	Details
Model Type	Sharif-wav2vec2 fine - tuned for Farsi
Training Data	108 hours of Commonvoice's Farsi samples with 16kHz sampling rate
Tags	audio, automatic - speech - recognition
Datasets	common_voice_6_1

🎯 Widget Examples

Common Voice Sample 1: Audio
Common Voice Sample 2: Audio

📊 Model Index

Model Name: Sharif-wav2vec2
Task: Automatic Speech Recognition
Dataset: Common Voice Corpus 6.1 (clean)
Metrics: Test WER = 6.0

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご