Wav2vec2 Large Xlsr Turkish
A speech recognition model fine-tuned on the Turkish Common Voice dataset based on facebook/wav2vec2-large-xlsr-53
Downloads 384
Release Time : 3/2/2022
Model Overview
This model is an automatic speech recognition (ASR) system optimized for Turkish, supporting voice input with a 16kHz sampling rate.
Model Features
Turkish Optimization
Specifically fine-tuned for Turkish, providing better speech recognition accuracy
Based on XLSR Large Model
Built on facebook's wav2vec2-large-xlsr-53 model, featuring strong speech feature extraction capabilities
16kHz Sampling Rate Support
Supports standard 16kHz sampling rate voice input
Model Capabilities
Turkish speech recognition
Audio to text
Automatic speech transcription
Use Cases
Speech Transcription
Turkish Speech to Text
Convert Turkish speech content into text
Word Error Rate (WER) 27.51%
Voice Assistants
Turkish Voice Command Recognition
Used for command recognition in Turkish voice assistant systems
🚀 Wav2Vec2-Large-XLSR-53-Turkish
This model is a fine - tuned version of facebook/wav2vec2-large-xlsr-53 for Turkish, leveraging the Common Voice dataset. Ensure your speech input is sampled at 16kHz when using this model.
🚀 Quick Start
The model can be used directly (without a language model) as follows:
💻 Usage Examples
Basic Usage
Requirements
# requirement packages
!pip install git+https://github.com/huggingface/datasets.git
!pip install git+https://github.com/huggingface/transformers.git
!pip install torchaudio
!pip install librosa
!pip install jiwer
Prediction
import librosa
import torch
import torchaudio
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
from datasets import load_dataset
import numpy as np
import re
import string
import IPython.display as ipd
chars_to_ignore = [
",", "?", ".", "!", "-", ";", ":", '""', "%", "'", '"', "�",
"#", "!", "?", "«", "»", "(", ")", "؛", ",", "?", ".", "!", "-", ";", ":", '"',
"“", "%", "‘", "�", "–", "…", "_", "”", '“', '„'
]
chars_to_mapping = {
"\u200c": " ", "\u200d": " ", "\u200e": " ", "\u200f": " ", "\ufeff": " ",
}
def multiple_replace(text, chars_to_mapping):
pattern = "|".join(map(re.escape, chars_to_mapping.keys()))
return re.sub(pattern, lambda m: chars_to_mapping[m.group()], str(text))
def remove_special_characters(text, chars_to_ignore_regex):
text = re.sub(chars_to_ignore_regex, '', text).lower() + " "
return text
def normalizer(batch, chars_to_ignore, chars_to_mapping):
chars_to_ignore_regex = f"""[{"".join(chars_to_ignore)}]"""
text = batch["sentence"].lower().strip()
text = text.replace("\u0307", " ").strip()
text = multiple_replace(text, chars_to_mapping)
text = remove_special_characters(text, chars_to_ignore_regex)
batch["sentence"] = text
return batch
def speech_file_to_array_fn(batch):
speech_array, sampling_rate = torchaudio.load(batch["path"])
speech_array = speech_array.squeeze().numpy()
speech_array = librosa.resample(np.asarray(speech_array), sampling_rate, 16_000)
batch["speech"] = speech_array
return batch
def predict(batch):
features = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)
input_values = features.input_values.to(device)
attention_mask = features.attention_mask.to(device)
with torch.no_grad():
logits = model(input_values, attention_mask=attention_mask).logits
pred_ids = torch.argmax(logits, dim=-1)
batch["predicted"] = processor.batch_decode(pred_ids)[0]
return batch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
processor = Wav2Vec2Processor.from_pretrained("m3hrdadfi/wav2vec2-large-xlsr-turkish")
model = Wav2Vec2ForCTC.from_pretrained("m3hrdadfi/wav2vec2-large-xlsr-turkish").to(device)
dataset = load_dataset("common_voice", "et", split="test[:1%]")
dataset = dataset.map(
normalizer,
fn_kwargs={"chars_to_ignore": chars_to_ignore, "chars_to_mapping": chars_to_mapping},
remove_columns=list(set(dataset.column_names) - set(['sentence', 'path']))
)
dataset = dataset.map(speech_file_to_array_fn)
result = dataset.map(predict)
max_items = np.random.randint(0, len(result), 10).tolist()
for i in max_items:
reference, predicted = result["sentence"][i], result["predicted"][i]
print("reference:", reference)
print("predicted:", predicted)
print('---')
Output:
reference: ülke şu anda iki federasyona üye
predicted: ülke şu anda iki federasyona üye
---
reference: foruma dört yüzde fazla kişi katıldı
predicted: soruma dört yüzden fazla kişi katıldı
---
reference: mobi altmış üç çalışanları da mutsuz
predicted: mobia haltmış üç çalışanları da mutsur
---
reference: kentin mali esnekliğinin düşük olduğu bildirildi
predicted: kentin mali esnekleğinin düşük olduğu bildirildi
---
reference: fouere iki ülkeyi sorunu abartmamaya çağırdı
predicted: foor iki ülkeyi soruna abartmamaya çanayordı
---
reference: o ülkeden herhangi bir tepki geldi mi
predicted: o ülkeden herhayın bir tepki geldi mi
---
reference: bunlara asla sırtımızı dönmeyeceğiz
predicted: bunlara asla sırtımızı dönmeyeceğiz
---
reference: sizi ayakta tutan nedir
predicted: sizi ayakta tutan nedir
---
reference: artık insanlar daha bireysel yaşıyor
predicted: artık insanlar daha bir eyselli yaşıyor
---
reference: her ikisi de diyaloga hazır olduğunu söylüyor
predicted: her ikisi de diyaloğa hazır olduğunu söylüyor
---
reference: merkez bankasının başlıca amacı düşük enflasyon
predicted: merkez bankasının başlrıca anatı güşükyen flasyon
---
reference: firefox
predicted: fair foks
---
reference: ülke halkı çok misafirsever ve dışa dönük
predicted: ülke halktı çok isatirtever ve dışa dönük
---
reference: ancak kamuoyu bu durumu pek de affetmiyor
predicted: ancak kamuonyulgukirmu pek deafıf etmiyor
---
reference: i ki madende iki bin beş yüzden fazla kişi çalışıyor
predicted: i ki madende iki bin beş yüzden fazla kişi çalışıyor
---
reference: sunnyside park dışarıdan oldukça iyi görünüyor
predicted: sani sahip park dışarıdan oldukça iyi görünüyor
---
reference: büyük ödül on beş bin avro
predicted: büyük ödül on beş bin avro
---
reference: köyümdeki camiler depoya dönüştürüldü
predicted: küyümdeki camiler depoya dönüştürüldü
---
reference: maç oldukça diplomatik bir sonuçla birbir bitti
predicted: maç oldukça diplomatik bir sonuçla bir birbitti
---
reference: kuşların ikisi de karantinada öldüler
predicted: kuşların ikiste karantinada özdüler
---
Advanced Usage
The model can be evaluated as follows on the Turkish test data of Common Voice.
import librosa
import torch
import torchaudio
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
from datasets import load_dataset, load_metric
import numpy as np
import re
import string
chars_to_ignore = [
",", "?", ".", "!", "-", ";", ":", '""', "%", "'", '"', "�",
"#", "!", "?", "«", "»", "(", ")", "؛", ",", "?", ".", "!", "-", ";", ":", '"',
"“", "%", "‘", "�", "–", "…", "_", "”", '“', '„'
]
chars_to_mapping = {
"\u200c": " ", "\u200d": " ", "\u200e": " ", "\u200f": " ", "\ufeff": " ",
"\u0307": " "
}
def multiple_replace(text, chars_to_mapping):
pattern = "|".join(map(re.escape, chars_to_mapping.keys()))
return re.sub(pattern, lambda m: chars_to_mapping[m.group()], str(text))
def remove_special_characters(text, chars_to_ignore_regex):
text = re.sub(chars_to_ignore_regex, '', text).lower() + " "
return text
def normalizer(batch, chars_to_ignore, chars_to_mapping):
chars_to_ignore_regex = f"""[{"".join(chars_to_ignore)}]"""
text = batch["sentence"].lower().strip()
text = text.replace("\u0307", " ").strip()
text = multiple_replace(text, chars_to_mapping)
text = remove_special_characters(text, chars_to_ignore_regex)
text = re.sub(" +", " ", text)
text = text.strip() + " "
batch["sentence"] = text
return batch
def speech_file_to_array_fn(batch):
speech_array, sampling_rate = torchaudio.load(batch["path"])
speech_array = speech_array.squeeze().numpy()
speech_array = librosa.resample(np.asarray(speech_array), sampling_rate, 16_000)
batch["speech"] = speech_array
return batch
def predict(batch):
features = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)
input_values = features.input_values.to(device)
attention_mask = features.attention_mask.to(device)
with torch.no_grad():
logits = model(input_values, attention_mask=attention_mask).logits
pred_ids = torch.argmax(logits, dim=-1)
batch["predicted"] = processor.batch_decode(pred_ids)[0]
return batch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
processor = Wav2Vec2Processor.from_pretrained("m3hrdadfi/wav2vec2-large-xlsr-turkish")
model = Wav2Vec2ForCTC.from_pretrained("m3hrdadfi/wav2vec2-large-xlsr-turkish").to(device)
dataset = load_dataset("common_voice", "tr", split="test")
dataset = dataset.map(
normalizer,
fn_kwargs={"chars_to_ignore": chars_to_ignore, "chars_to_mapping": chars_to_mapping},
remove_columns=list(set(dataset.column_names) - set(['sentence', 'path']))
)
dataset = dataset.map(speech_file_to_array_fn)
result = dataset.map(predict)
wer = load_metric("wer")
print("WER: {:.2f}".format(100 * wer.compute(predictions=result["predicted"], references=result["sentence"])))
Test Result:
- WER: 27.51%
📚 Documentation
Training & Report
The Common Voice train
, validation
datasets were used for training.
You can see the training states here
The script used for training can be found here
📄 License
This model is licensed under the apache - 2.0
license.
Property | Details |
---|---|
Model Type | Fine - tuned Wav2Vec2 - Large - XLSR - 53 for Turkish |
Training Data | Common Voice train , validation datasets |
Voice Activity Detection
MIT
Voice activity detection model based on pyannote.audio 2.1, used to identify speech activity segments in audio
Speech Recognition
V
pyannote
7.7M
181
Wav2vec2 Large Xlsr 53 Portuguese
Apache-2.0
This is a fine-tuned XLSR-53 large model for Portuguese speech recognition tasks, trained on the Common Voice 6.1 dataset, supporting Portuguese speech-to-text conversion.
Speech Recognition Other
W
jonatasgrosman
4.9M
32
Whisper Large V3
Apache-2.0
Whisper is an advanced automatic speech recognition (ASR) and speech translation model proposed by OpenAI, trained on over 5 million hours of labeled data, with strong cross-dataset and cross-domain generalization capabilities.
Speech Recognition Supports Multiple Languages
W
openai
4.6M
4,321
Whisper Large V3 Turbo
MIT
Whisper is a state-of-the-art automatic speech recognition (ASR) and speech translation model developed by OpenAI, trained on over 5 million hours of labeled data, demonstrating strong generalization capabilities in zero-shot settings.
Speech Recognition
Transformers Supports Multiple Languages

W
openai
4.0M
2,317
Wav2vec2 Large Xlsr 53 Russian
Apache-2.0
A Russian speech recognition model fine-tuned from facebook/wav2vec2-large-xlsr-53, supporting 16kHz sampled audio input
Speech Recognition Other
W
jonatasgrosman
3.9M
54
Wav2vec2 Large Xlsr 53 Chinese Zh Cn
Apache-2.0
A Chinese speech recognition model fine-tuned based on facebook/wav2vec2-large-xlsr-53, supporting 16kHz sampling rate audio input.
Speech Recognition Chinese
W
jonatasgrosman
3.8M
110
Wav2vec2 Large Xlsr 53 Dutch
Apache-2.0
A Dutch speech recognition model fine-tuned based on facebook/wav2vec2-large-xlsr-53, trained on the Common Voice and CSS10 datasets, supporting 16kHz audio input.
Speech Recognition Other
W
jonatasgrosman
3.0M
12
Wav2vec2 Large Xlsr 53 Japanese
Apache-2.0
Japanese speech recognition model fine-tuned from facebook/wav2vec2-large-xlsr-53, supporting 16kHz sampling rate audio input
Speech Recognition Japanese
W
jonatasgrosman
2.9M
33
Mms 300m 1130 Forced Aligner
A text-to-audio forced alignment tool based on Hugging Face pre-trained models, supporting multiple languages with high memory efficiency
Speech Recognition
Transformers Supports Multiple Languages

M
MahmoudAshraf
2.5M
50
Wav2vec2 Large Xlsr 53 Arabic
Apache-2.0
Arabic speech recognition model fine-tuned from facebook/wav2vec2-large-xlsr-53, trained on Common Voice and Arabic speech corpus
Speech Recognition Arabic
W
jonatasgrosman
2.3M
37
Featured Recommended AI Models