Open-source Speech Recognition Model wav2vec2-large-xlsr-turkish - Accurately Identify Turkish Speech Content

Wav2vec2 Large Xlsr Turkish

Developed by m3hrdadfi

A speech recognition model fine-tuned on the Turkish Common Voice dataset based on facebook/wav2vec2-large-xlsr-53

Speech Recognition OtherOpen Source License:Apache-2.0 #Turkish speech recognition #XLSR fine-tuned model #Low-resource speech processing

Downloads 384

Release Time : 3/2/2022

Model Overview

This model is an automatic speech recognition (ASR) system optimized for Turkish, supporting voice input with a 16kHz sampling rate.

Model Features

Turkish Optimization

Specifically fine-tuned for Turkish, providing better speech recognition accuracy

Based on XLSR Large Model

Built on facebook's wav2vec2-large-xlsr-53 model, featuring strong speech feature extraction capabilities

16kHz Sampling Rate Support

Supports standard 16kHz sampling rate voice input

Model Capabilities

Turkish speech recognition

Audio to text

Automatic speech transcription

Use Cases

Speech Transcription

Turkish Speech to Text

Convert Turkish speech content into text

Word Error Rate (WER) 27.51%

Voice Assistants

Turkish Voice Command Recognition

Used for command recognition in Turkish voice assistant systems

🚀 Wav2Vec2-Large-XLSR-53-Turkish

This model is a fine - tuned version of facebook/wav2vec2-large-xlsr-53 for Turkish, leveraging the Common Voice dataset. Ensure your speech input is sampled at 16kHz when using this model.

🚀 Quick Start

The model can be used directly (without a language model) as follows:

💻 Usage Examples

Basic Usage

Requirements

# requirement packages
!pip install git+https://github.com/huggingface/datasets.git
!pip install git+https://github.com/huggingface/transformers.git
!pip install torchaudio
!pip install librosa
!pip install jiwer

Prediction

import librosa
import torch
import torchaudio
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
from datasets import load_dataset

import numpy as np
import re
import string

import IPython.display as ipd

chars_to_ignore = [
    ",", "?", ".", "!", "-", ";", ":", '""', "%", "'", '"', "�",
    "#", "!", "?", "«", "»", "(", ")", "؛", ",", "?", ".", "!", "-", ";", ":", '"', 
    "“", "%", "‘", "�", "–", "…", "_", "”", '“', '„'
]
chars_to_mapping = {
"\u200c": " ", "\u200d": " ", "\u200e": " ", "\u200f": " ", "\ufeff": " ",
}

def multiple_replace(text, chars_to_mapping):
    pattern = "|".join(map(re.escape, chars_to_mapping.keys()))
    return re.sub(pattern, lambda m: chars_to_mapping[m.group()], str(text))

def remove_special_characters(text, chars_to_ignore_regex):
    text = re.sub(chars_to_ignore_regex, '', text).lower() + " "
    return text

def normalizer(batch, chars_to_ignore, chars_to_mapping):
    chars_to_ignore_regex = f"""[{"".join(chars_to_ignore)}]"""
    text = batch["sentence"].lower().strip()
    
    text = text.replace("\u0307", " ").strip()
    text = multiple_replace(text, chars_to_mapping)
    text = remove_special_characters(text, chars_to_ignore_regex)

    batch["sentence"] = text
    return batch


def speech_file_to_array_fn(batch):
    speech_array, sampling_rate = torchaudio.load(batch["path"])
    speech_array = speech_array.squeeze().numpy()
    speech_array = librosa.resample(np.asarray(speech_array), sampling_rate, 16_000)

    batch["speech"] = speech_array
    return batch


def predict(batch):
    features = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)

    input_values = features.input_values.to(device)
    attention_mask = features.attention_mask.to(device)

    with torch.no_grad():
        logits = model(input_values, attention_mask=attention_mask).logits 
        
    pred_ids = torch.argmax(logits, dim=-1)

    batch["predicted"] = processor.batch_decode(pred_ids)[0]
    return batch


device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
processor = Wav2Vec2Processor.from_pretrained("m3hrdadfi/wav2vec2-large-xlsr-turkish")
model = Wav2Vec2ForCTC.from_pretrained("m3hrdadfi/wav2vec2-large-xlsr-turkish").to(device)

dataset = load_dataset("common_voice", "et", split="test[:1%]")
dataset = dataset.map(
    normalizer, 
    fn_kwargs={"chars_to_ignore": chars_to_ignore, "chars_to_mapping": chars_to_mapping},
    remove_columns=list(set(dataset.column_names) - set(['sentence', 'path']))
)

dataset = dataset.map(speech_file_to_array_fn)
result = dataset.map(predict)

max_items = np.random.randint(0, len(result), 10).tolist()
for i in max_items:
    reference, predicted =  result["sentence"][i], result["predicted"][i]
    print("reference:", reference)
    print("predicted:", predicted)
    print('---')

Output:

reference: ülke şu anda iki federasyona üye 
predicted: ülke şu anda iki federasyona üye
---
reference: foruma dört yüzde fazla kişi katıldı 
predicted: soruma dört yüzden fazla kişi katıldı
---
reference: mobi altmış üç çalışanları da mutsuz 
predicted: mobia haltmış üç çalışanları da mutsur
---
reference: kentin mali esnekliğinin düşük olduğu bildirildi 
predicted: kentin mali esnekleğinin düşük olduğu bildirildi
---
reference: fouere iki ülkeyi sorunu abartmamaya çağırdı 
predicted: foor iki ülkeyi soruna abartmamaya çanayordı
---
reference: o ülkeden herhangi bir tepki geldi mi 
predicted: o ülkeden herhayın bir tepki geldi mi
---
reference: bunlara asla sırtımızı dönmeyeceğiz 
predicted: bunlara asla sırtımızı dönmeyeceğiz
---
reference: sizi ayakta tutan nedir 
predicted: sizi ayakta tutan nedir
---
reference: artık insanlar daha bireysel yaşıyor 
predicted: artık insanlar daha bir eyselli yaşıyor
---
reference: her ikisi de diyaloga hazır olduğunu söylüyor 
predicted: her ikisi de diyaloğa hazır olduğunu söylüyor
---
reference: merkez bankasının başlıca amacı düşük enflasyon 
predicted: merkez bankasının başlrıca anatı güşükyen flasyon
---
reference: firefox 
predicted: fair foks
---
reference: ülke halkı çok misafirsever ve dışa dönük 
predicted: ülke halktı çok isatirtever ve dışa dönük
---
reference: ancak kamuoyu bu durumu pek de affetmiyor 
predicted: ancak kamuonyulgukirmu pek deafıf etmiyor
---
reference: i ki madende iki bin beş yüzden fazla kişi çalışıyor 
predicted: i ki madende iki bin beş yüzden fazla kişi çalışıyor
---
reference: sunnyside park dışarıdan oldukça iyi görünüyor 
predicted: sani sahip park dışarıdan oldukça iyi görünüyor
---
reference: büyük ödül on beş bin avro 
predicted: büyük ödül on beş bin avro
---
reference: köyümdeki camiler depoya dönüştürüldü 
predicted: küyümdeki camiler depoya dönüştürüldü
---
reference: maç oldukça diplomatik bir sonuçla birbir bitti 
predicted: maç oldukça diplomatik bir sonuçla bir birbitti
---
reference: kuşların ikisi de karantinada öldüler 
predicted: kuşların ikiste karantinada özdüler
---

Advanced Usage

The model can be evaluated as follows on the Turkish test data of Common Voice.

import librosa
import torch
import torchaudio
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
from datasets import load_dataset, load_metric

import numpy as np
import re
import string


chars_to_ignore = [
    ",", "?", ".", "!", "-", ";", ":", '""', "%", "'", '"', "�",
    "#", "!", "?", "«", "»", "(", ")", "؛", ",", "?", ".", "!", "-", ";", ":", '"', 
    "“", "%", "‘", "�", "–", "…", "_", "”", '“', '„'
]
chars_to_mapping = {
    "\u200c": " ", "\u200d": " ", "\u200e": " ", "\u200f": " ", "\ufeff": " ",
    "\u0307": " "
}

def multiple_replace(text, chars_to_mapping):
    pattern = "|".join(map(re.escape, chars_to_mapping.keys()))
    return re.sub(pattern, lambda m: chars_to_mapping[m.group()], str(text))

def remove_special_characters(text, chars_to_ignore_regex):
    text = re.sub(chars_to_ignore_regex, '', text).lower() + " "
    return text

def normalizer(batch, chars_to_ignore, chars_to_mapping):
    chars_to_ignore_regex = f"""[{"".join(chars_to_ignore)}]"""
    text = batch["sentence"].lower().strip()
    
    text = text.replace("\u0307", " ").strip()
    text = multiple_replace(text, chars_to_mapping)
    text = remove_special_characters(text, chars_to_ignore_regex)
    text = re.sub(" +", " ", text)
    text = text.strip() + " "

    batch["sentence"] = text
    return batch


def speech_file_to_array_fn(batch):
    speech_array, sampling_rate = torchaudio.load(batch["path"])
    speech_array = speech_array.squeeze().numpy()
    speech_array = librosa.resample(np.asarray(speech_array), sampling_rate, 16_000)

    batch["speech"] = speech_array
    return batch


def predict(batch):
    features = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)

    input_values = features.input_values.to(device)
    attention_mask = features.attention_mask.to(device)

    with torch.no_grad():
        logits = model(input_values, attention_mask=attention_mask).logits 
        
    pred_ids = torch.argmax(logits, dim=-1)

    batch["predicted"] = processor.batch_decode(pred_ids)[0]
    return batch


device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
processor = Wav2Vec2Processor.from_pretrained("m3hrdadfi/wav2vec2-large-xlsr-turkish")
model = Wav2Vec2ForCTC.from_pretrained("m3hrdadfi/wav2vec2-large-xlsr-turkish").to(device)

dataset = load_dataset("common_voice", "tr", split="test")
dataset = dataset.map(
    normalizer, 
    fn_kwargs={"chars_to_ignore": chars_to_ignore, "chars_to_mapping": chars_to_mapping},
    remove_columns=list(set(dataset.column_names) - set(['sentence', 'path']))
)

dataset = dataset.map(speech_file_to_array_fn)
result = dataset.map(predict)

wer = load_metric("wer")

print("WER: {:.2f}".format(100 * wer.compute(predictions=result["predicted"], references=result["sentence"])))

Test Result:

WER: 27.51%

📚 Documentation

Training & Report

The Common Voice train, validation datasets were used for training.

You can see the training states here

The script used for training can be found here

📄 License

This model is licensed under the apache - 2.0 license.

Property	Details
Model Type	Fine - tuned Wav2Vec2 - Large - XLSR - 53 for Turkish
Training Data	Common Voice `train`, `validation` datasets

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご