Wav2vec2 Large Xlsr 53 Persian

Developed by jonatasgrosman

XLSR-53 large model speech recognition system optimized for Persian, fine-tuned based on facebook/wav2vec2-large-xlsr-53 architecture

Speech Recognition OtherOpen Source License:Apache-2.0 #Persian speech recognition #XLSR-53 large model #Low character error rate

Downloads 257.76k

Release Time : 3/2/2022

Model Overview

This model is a Persian speech recognition system optimized based on the XLSR-53 architecture, trained using the Common Voice 6.1 Persian dataset, suitable for Persian speech-to-text tasks.

Model Features

High-performance Persian recognition

Achieves 30.12% word error rate and 7.37% character error rate on the Common Voice Persian test set

Based on XLSR-53 architecture

Utilizes the large-scale self-supervised pre-trained XLSR-53 model for fine-tuning

16kHz sampling rate support

Optimized for 16kHz sampling rate voice input

Model Capabilities

Persian speech recognition

Speech-to-text

Audio transcription

Use Cases

Speech transcription

Persian speech-to-text

Convert Persian speech content into text format

Achieves 30.12% word error rate on the Common Voice test set

Voice assistants

Persian voice command recognition

For understanding voice commands in Persian voice assistants

language: fa datasets:

common_voice metrics:
wer
cer tags:
audio
automatic-speech-recognition
speech
xlsr-fine-tuning-week license: apache-2.0 model-index:
name: XLSR Wav2Vec2 Persian by Jonatas Grosman results:
- task: name: Speech Recognition type: automatic-speech-recognition dataset: name: Common Voice fa type: common_voice args: fa metrics:
  - name: Test WER type: wer value: 30.12
  - name: Test CER type: cer value: 7.37

Fine-tuned XLSR-53 large model for speech recognition in Persian

Fine-tuned facebook/wav2vec2-large-xlsr-53 on Persian using the train and validation splits of Common Voice 6.1. When using this model, make sure that your speech input is sampled at 16kHz.

This model has been fine-tuned thanks to the GPU credits generously given by the OVHcloud :)

The script used for training can be found here: https://github.com/jonatasgrosman/wav2vec2-sprint

Usage

The model can be used directly (without a language model) as follows...

Using the HuggingSound library:

from huggingsound import SpeechRecognitionModel

model = SpeechRecognitionModel("jonatasgrosman/wav2vec2-large-xlsr-53-persian")
audio_paths = ["/path/to/file.mp3", "/path/to/another_file.wav"]

transcriptions = model.transcribe(audio_paths)

Writing your own inference script:

import torch
import librosa
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

LANG_ID = "fa"
MODEL_ID = "jonatasgrosman/wav2vec2-large-xlsr-53-persian"
SAMPLES = 5

test_dataset = load_dataset("common_voice", LANG_ID, split=f"test[:{SAMPLES}]")

processor = Wav2Vec2Processor.from_pretrained(MODEL_ID)
model = Wav2Vec2ForCTC.from_pretrained(MODEL_ID)

# Preprocessing the datasets.
# We need to read the audio files as arrays
def speech_file_to_array_fn(batch):
    speech_array, sampling_rate = librosa.load(batch["path"], sr=16_000)
    batch["speech"] = speech_array
    batch["sentence"] = batch["sentence"].upper()
    return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)
inputs = processor(test_dataset["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)

with torch.no_grad():
    logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits

predicted_ids = torch.argmax(logits, dim=-1)
predicted_sentences = processor.batch_decode(predicted_ids)

for i, predicted_sentence in enumerate(predicted_sentences):
    print("-" * 100)
    print("Reference:", test_dataset[i]["sentence"])
    print("Prediction:", predicted_sentence)

Reference	Prediction
از مهمونداری کنار بکشم	از مهمانداری کنار بکشم
برو از مهرداد بپرس.	برو از ماقدعاد به پرس
خب ، تو چیكار می كنی؟	خوب تو چیکار می کنی
مسقط پایتخت عمان در عربی به معنای محل سقوط است	مسقط پایتخت عمان در عربی به بعنای محل سقوط است
آه، نه اصلاُ!	اهنه اصلا
توانست	توانست
قصیده فن شعر میگوید ای دوستان	قصیده فن شعر میگوید ایدوستون
دو استایل متفاوت دارین	دوبوست داریل و متفاوت بری
دو روز قبل از کریسمس ؟	اون مفتود پش پشش
ساعت های کاری چیست؟	این توری که موشیکل خب

Evaluation

The model can be evaluated as follows on the Persian test data of Common Voice.

import torch
import re
import librosa
from datasets import load_dataset, load_metric
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

LANG_ID = "fa"
MODEL_ID = "jonatasgrosman/wav2vec2-large-xlsr-53-persian"
DEVICE = "cuda"

CHARS_TO_IGNORE = [",", "?", "¿", ".", "!", "¡", ";", "；", ":", '""', "%", '"', "�", "ʿ", "·", "჻", "~", "՞",
                   "؟", "،", "।", "॥", "«", "»", "„", "“", "”", "「", "」", "‘", "’", "《", "》", "(", ")", "[", "]",
                   "{", "}", "=", "`", "_", "+", "<", ">", "…", "–", "°", "´", "ʾ", "‹", "›", "©", "®", "—", "→", "。",
                   "、", "﹂", "﹁", "‧", "～", "﹏", "，", "｛", "｝", "（", "）", "［", "］", "【", "】", "‥", "〽",
                   "『", "』", "〝", "〟", "⟨", "⟩", "〜", "：", "！", "？", "♪", "؛", "/", "\\", "º", "−", "^", "ʻ", "ˆ"]

test_dataset = load_dataset("common_voice", LANG_ID, split="test")

wer = load_metric("wer.py") # https://github.com/jonatasgrosman/wav2vec2-sprint/blob/main/wer.py
cer = load_metric("cer.py") # https://github.com/jonatasgrosman/wav2vec2-sprint/blob/main/cer.py

chars_to_ignore_regex = f"[{re.escape(''.join(CHARS_TO_IGNORE))}]"

processor = Wav2Vec2Processor.from_pretrained(MODEL_ID)
model = Wav2Vec2ForCTC.from_pretrained(MODEL_ID)
model.to(DEVICE)

# Preprocessing the datasets.
# We need to read the audio files as arrays
def speech_file_to_array_fn(batch):
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")
        speech_array, sampling_rate = librosa.load(batch["path"], sr=16_000)
    batch["speech"] = speech_array
    batch["sentence"] = re.sub(chars_to_ignore_regex, "", batch["sentence"]).upper()
    return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)

# Preprocessing the datasets.
# We need to read the audio files as arrays
def evaluate(batch):
    inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)

    with torch.no_grad():
        logits = model(inputs.input_values.to(DEVICE), attention_mask=inputs.attention_mask.to(DEVICE)).logits

    pred_ids = torch.argmax(logits, dim=-1)
    batch["pred_strings"] = processor.batch_decode(pred_ids)
    return batch

result = test_dataset.map(evaluate, batched=True, batch_size=8)

predictions = [x.upper() for x in result["pred_strings"]]
references = [x.upper() for x in result["sentence"]]

print(f"WER: {wer.compute(predictions=predictions, references=references, chunk_size=1000) * 100}")
print(f"CER: {cer.compute(predictions=predictions, references=references, chunk_size=1000) * 100}")

Test Result:

In the table below I report the Word Error Rate (WER) and the Character Error Rate (CER) of the model. I ran the evaluation script described above on other models as well (on 2021-04-22). Note that the table below may show different results from those already reported, this may have been caused due to some specificity of the other evaluation scripts used.

Model	WER	CER
jonatasgrosman/wav2vec2-large-xlsr-53-persian	30.12%	7.37%
m3hrdadfi/wav2vec2-large-xlsr-persian-v2	33.85%	8.79%
m3hrdadfi/wav2vec2-large-xlsr-persian	34.37%	8.98%

Citation

If you want to cite this model you can use this:

@misc{grosman2021xlsr53-large-persian,
  title={Fine-tuned {XLSR}-53 large model for speech recognition in {P}ersian},
  author={Grosman, Jonatas},
  howpublished={\url{https://huggingface.co/jonatasgrosman/wav2vec2-large-xlsr-53-persian}},
  year={2021}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご