Wav2vec2 Large Xlsr Persian V3
An automatic speech recognition (ASR) model fine-tuned on the Persian Common Voice dataset based on Facebook's wav2vec2-large-xlsr-53 model
Downloads 1,888
Release Time : 3/2/2022
Model Overview
This model is specifically designed for Persian (Farsi) speech recognition tasks, achieving high transcription accuracy through large-scale pre-training with XLSR architecture and fine-tuning on Persian data.
Model Features
Low Word Error Rate
Achieves a WER (Word Error Rate) of 10.36% on Persian test sets
Large-scale Pre-training
Based on the cross-lingual pre-trained model facebook/wav2vec2-large-xlsr-53
Specialized Data Fine-tuning
Fine-tuned using the Persian version of the Common Voice dataset
Model Capabilities
Persian Speech Recognition
16kHz Audio Processing
Long Speech Transcription
Use Cases
Speech Transcription
Persian Speech Transcription
Convert Persian speech content into text
Approximately 90% accuracy (WER 10.36%)
Voice Assistants
Persian Voice Command Recognition
Provides core recognition capabilities for Persian voice assistants
🚀 Wav2Vec2-Large-XLSR-53-Persian V3
This project is a fine - tuned model for Persian (Farsi) speech recognition, leveraging the power of the XLSR Wav2Vec2 architecture.
🚀 Quick Start
✨ Features
- Fine - tuned facebook/wav2vec2-large-xlsr-53 model for Persian (Farsi) language.
- Utilizes Common Voice dataset for training.
- Achieves a Word Error Rate (WER) of 10.36% on the test set.
📦 Installation
Requirements
# requirement packages
!pip install git+https://github.com/huggingface/datasets.git
!pip install git+https://github.com/huggingface/transformers.git
!pip install torchaudio
!pip install librosa
!pip install jiwer
!pip install parsivar
!pip install num2fawords
Normalizer
# Normalizer
!wget -O normalizer.py https://huggingface.co/m3hrdadfi/"wav2vec2-large-xlsr-persian-v3/raw/main/dictionary.py
!wget -O normalizer.py https://huggingface.co/m3hrdadfi/"wav2vec2-large-xlsr-persian-v3/raw/main/normalizer.py
Downloading data
wget https://voice-prod-bundler-ee1969a6ce8178826482b88e843c335139bd3fb4.s3.amazonaws.com/cv-corpus-6.1-2020-12-11/fa.tar.gz
tar -xzf fa.tar.gz
rm -rf fa.tar.gz
💻 Usage Examples
Basic Usage
Fine - tuned facebook/wav2vec2-large-xlsr-53 in Persian (Farsi) using Common Voice. When using this model, make sure that your speech input is sampled at 16kHz.
Cleaning
from normalizer import normalizer
def cleaning(text):
if not isinstance(text, str):
return None
return normalizer({"sentence": text}, return_dict=False)
data_dir = "/content/cv-corpus-6.1-2020-12-11/fa"
test = pd.read_csv(f"{data_dir}/test.tsv", sep=" ")
test["path"] = data_dir + "/clips/" + test["path"]
print(f"Step 0: {len(test)}")
test["status"] = test["path"].apply(lambda path: True if os.path.exists(path) else None)
test = test.dropna(subset=["path"])
test = test.drop("status", 1)
print(f"Step 1: {len(test)}")
test["sentence"] = test["sentence"].apply(lambda t: cleaning(t))
test = test.dropna(subset=["sentence"])
print(f"Step 2: {len(test)}")
test = test.reset_index(drop=True)
print(test.head())
test = test[["path", "sentence"]]
test.to_csv("/content/test.csv", sep=" ", encoding="utf-8", index=False)
Prediction
import numpy as np
import pandas as pd
import librosa
import torch
import torchaudio
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
from datasets import load_dataset, load_metric
import IPython.display as ipd
model_name_or_path = "m3hrdadfi/wav2vec2-large-xlsr-persian-v3"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(model_name_or_path, device)
processor = Wav2Vec2Processor.from_pretrained(model_name_or_path)
model = Wav2Vec2ForCTC.from_pretrained(model_name_or_path).to(device)
def speech_file_to_array_fn(batch):
speech_array, sampling_rate = torchaudio.load(batch["path"])
speech_array = speech_array.squeeze().numpy()
speech_array = librosa.resample(np.asarray(speech_array), sampling_rate, processor.feature_extractor.sampling_rate)
batch["speech"] = speech_array
return batch
def predict(batch):
features = processor(
batch["speech"],
sampling_rate=processor.feature_extractor.sampling_rate,
return_tensors="pt",
padding=True
)
input_values = features.input_values.to(device)
attention_mask = features.attention_mask.to(device)
with torch.no_grad():
logits = model(input_values, attention_mask=attention_mask).logits
pred_ids = torch.argmax(logits, dim=-1)
batch["predicted"] = processor.batch_decode(pred_ids)
return batch
dataset = load_dataset("csv", data_files={"test": "/content/test.csv"}, delimiter=" ")["test"]
dataset = dataset.map(speech_file_to_array_fn)
result = dataset.map(predict, batched=True, batch_size=4)
WER Score
wer = load_metric("wer")
print("WER: {:.2f}".format(100 * wer.compute(predictions=result["predicted"], references=result["sentence"])))
Output
max_items = np.random.randint(0, len(result), 20).tolist()
for i in max_items:
reference, predicted = result["sentence"][i], result["predicted"][i]
print("reference:", reference)
print("predicted:", predicted)
print('---')
reference: ماجرا رو براش تعریف کردم اون گفت مریم اگه میدونی پسر خوبیه خب چه اشکالی داره باهاش بیشتر اشنا بشو
predicted: ماجرا رو براش تعریف کردم اون گفت مریم اگه میدونی پسر خوبیه خب چه اشکالی داره باهاش بیشتر اشنا بشو
---
reference: بیا پایین تو اجازه نداری بری اون بالا
predicted: بیا پایین تو اجازه نداری بری اون بالا
---
reference: هر روز یک دو مداد کش می رفتتم تااین که تا پایان ترم از تمامی دوستانم مداد برداشته بودم
predicted: هر روز یک دو مداد کش می رفتم تااین که تا پایین ترم از تمامی دوستان و مداد برداشته بودم
---
reference: فکر میکنی آروم میشینه
predicted: فکر میکنی آروم میشینه
---
reference: هرکسی با گوشی هوشمند خود میتواند با کایلا متصل گردد در یک محدوده مکانی
predicted: هرکسی با گوشی هوشمند خود میتواند با کایلا متصل گردد در یک محدوده مکانی
---
reference: برو از مهرداد بپرس
predicted: برو از مهرداد بپرس
---
reference: می خواهم شما را با این قدمها آشنا کنم
predicted: می خواهم شما را با این قدمها آشنا کنم
---
reference: میدونم یه روز دوباره می تونم تو رو ببینم
predicted: میدونم یه روز دوباره می تونم تو رو ببینم
---
reference: بسیار خوب خواهد بود دعوت او را بپذیری
predicted: بسیار خوب خواهد بود دعوت او را بپذیری
---
reference: بهت بگن آشغالی خوبه
predicted: بهت بگن آشغالی خوبه
---
reference: چرا معاشرت با هم ایمانان ما را محفوظ نگه میدارد
predicted: چرا معاشرت با هم ایمانان آ را م حفوظ نگه میدارد
---
reference: بولیوی پس از گویان فقیرترین کشور آمریکای جنوبی است
predicted: بولیوی پس از گویان فقیرترین کشور آمریکای جنوبی است
---
reference: بعد از مدتی اینکار برایم عادی شد
predicted: بعد از مدتی اینکار برایم عادو شد
---
reference: به نظر اون هم همینطوره
predicted: به نظر اون هم همینطوره
---
reference: هیچ مایونز ی دارید
predicted: هیچ مایونز ی دارید
---
reference: هیچ یک از انان کاری به سنگ نداشتند
predicted: هیچ شک از انان کاری به سنگ نداشتند
---
reference: می خواهم کمی کتاب شعر ببینم
predicted: می خواهم کتاب شعر ببینم
---
reference: همین شوهر فهیمه مگه نمی گفتی فرمانده بوده کو
predicted: همین شوهر فهیمه بینامی گفتی فهمانده بود کو
---
reference: اون جاها کسی رو نمیبینی که تو دستش کتاب نباشه
predicted: اون جاها کسی رو نمیبینی که تو دستش کتاب نباشه
---
reference: زندان رفتن من در این سالهای اخیر برام شانس بزرگی بود که معما و مشکل چندین سالهام را حل کرد
predicted: زندان رفتن من در این سالها اخی براب شانس بزرگی بود که معما و مشکل چندین سالهام را حل کرد
---
📚 Documentation
Evaluation
Test Result:
- WER: 10.36%
📄 License
No license information provided in the original document.
Information Table
| Property | Details |
|----------|---------|
| Model Type | Fine - tuned XLSR Wav2Vec2 for Persian (Farsi) |
| Training Data | Common Voice |
Voice Activity Detection
MIT
Voice activity detection model based on pyannote.audio 2.1, used to identify speech activity segments in audio
Speech Recognition
V
pyannote
7.7M
181
Wav2vec2 Large Xlsr 53 Portuguese
Apache-2.0
This is a fine-tuned XLSR-53 large model for Portuguese speech recognition tasks, trained on the Common Voice 6.1 dataset, supporting Portuguese speech-to-text conversion.
Speech Recognition Other
W
jonatasgrosman
4.9M
32
Whisper Large V3
Apache-2.0
Whisper is an advanced automatic speech recognition (ASR) and speech translation model proposed by OpenAI, trained on over 5 million hours of labeled data, with strong cross-dataset and cross-domain generalization capabilities.
Speech Recognition Supports Multiple Languages
W
openai
4.6M
4,321
Whisper Large V3 Turbo
MIT
Whisper is a state-of-the-art automatic speech recognition (ASR) and speech translation model developed by OpenAI, trained on over 5 million hours of labeled data, demonstrating strong generalization capabilities in zero-shot settings.
Speech Recognition
Transformers Supports Multiple Languages

W
openai
4.0M
2,317
Wav2vec2 Large Xlsr 53 Russian
Apache-2.0
A Russian speech recognition model fine-tuned from facebook/wav2vec2-large-xlsr-53, supporting 16kHz sampled audio input
Speech Recognition Other
W
jonatasgrosman
3.9M
54
Wav2vec2 Large Xlsr 53 Chinese Zh Cn
Apache-2.0
A Chinese speech recognition model fine-tuned based on facebook/wav2vec2-large-xlsr-53, supporting 16kHz sampling rate audio input.
Speech Recognition Chinese
W
jonatasgrosman
3.8M
110
Wav2vec2 Large Xlsr 53 Dutch
Apache-2.0
A Dutch speech recognition model fine-tuned based on facebook/wav2vec2-large-xlsr-53, trained on the Common Voice and CSS10 datasets, supporting 16kHz audio input.
Speech Recognition Other
W
jonatasgrosman
3.0M
12
Wav2vec2 Large Xlsr 53 Japanese
Apache-2.0
Japanese speech recognition model fine-tuned from facebook/wav2vec2-large-xlsr-53, supporting 16kHz sampling rate audio input
Speech Recognition Japanese
W
jonatasgrosman
2.9M
33
Mms 300m 1130 Forced Aligner
A text-to-audio forced alignment tool based on Hugging Face pre-trained models, supporting multiple languages with high memory efficiency
Speech Recognition
Transformers Supports Multiple Languages

M
MahmoudAshraf
2.5M
50
Wav2vec2 Large Xlsr 53 Arabic
Apache-2.0
Arabic speech recognition model fine-tuned from facebook/wav2vec2-large-xlsr-53, trained on Common Voice and Arabic speech corpus
Speech Recognition Arabic
W
jonatasgrosman
2.3M
37
Featured Recommended AI Models