wav2vec2-large-xlsr-turkish-artificial Open Source Model - Accurately Implement Turkish Speech Recognition

Wav2vec2 Large Xlsr Turkish Artificial

Developed by cahya

This is a Turkish speech recognition model fine-tuned from facebook/wav2vec2-large-xlsr-53, trained using artificial Common Voice dataset.

Speech Recognition OtherOpen Source License:Apache-2.0 #Turkish speech recognition #Artificial voice training #XLSR fine-tuning

Downloads 25

Release Time : 3/2/2022

Model Overview

This model is designed for Turkish automatic speech recognition (ASR) tasks, supporting voice input with 16kHz sampling rate.

Model Features

Artificial voice training

Trained with artificial Common Voice dataset, potentially improving recognition capability for specific voice characteristics

XLSR architecture-based

Fine-tuned from the powerful wav2vec2-large-xlsr-53 architecture, featuring excellent voice feature extraction capability

Turkish language support

Speech recognition model specifically optimized for Turkish

Model Capabilities

Turkish speech recognition

16kHz audio processing

Use Cases

Speech-to-text

Turkish speech transcription

Convert Turkish speech to text

Achieved WER of 66.98% on Common Voice Turkish test set

🚀 Wav2Vec2-Large-XLSR-Turkish

This is a fine - tuned facebook/wav2vec2-large-xlsr-53 model on the Turkish Artificial Common Voice dataset, which can be used for automatic speech recognition.

🚀 Quick Start

When using this model, make sure that your speech input is sampled at 16kHz.

✨ Features

Language: Turkish
Datasets: Common Voice
Metrics: Word Error Rate (WER)
Tags: audio, automatic - speech - recognition, speech, xlsr - fine - tuning - week
License: Apache 2.0

Property	Details
Model Type	XLSR Wav2Vec2 Turkish with Artificial Voices by Cahya
Training Data	Turkish Artificial Common Voice dataset (`train`, `validation`)

📦 Installation

No specific installation steps are provided in the original README. However, you need to have the necessary Python libraries installed, such as torch, torchaudio, datasets, and transformers. You can install them using pip:

pip install torch torchaudio datasets transformers

💻 Usage Examples

Basic Usage

import torch
import torchaudio
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

test_dataset = load_dataset("common_voice", "tr", split="test[:2%]")

processor = Wav2Vec2Processor.from_pretrained("cahya/wav2vec2-large-xlsr-turkish-artificial")
model = Wav2Vec2ForCTC.from_pretrained("cahya/wav2vec2-large-xlsr-turkish-artificial")


# Preprocessing the datasets.
# We need to read the aduio files as arrays
def speech_file_to_array_fn(batch):
  speech_array, sampling_rate = torchaudio.load(batch["path"])
  resampler = torchaudio.transforms.Resample(sampling_rate, 16_000)
  batch["speech"] = resampler(speech_array).squeeze().numpy()
  return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)
inputs = processor(test_dataset[:2]["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)

with torch.no_grad():
  logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits

predicted_ids = torch.argmax(logits, dim=-1)

print("Prediction:", processor.batch_decode(predicted_ids))
print("Reference:", test_dataset[:2]["sentence"])

Advanced Usage

import torch
import torchaudio
from datasets import load_dataset, load_metric
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import re

test_dataset = load_dataset("common_voice", "tr", split="test")
wer = load_metric("wer")

processor = Wav2Vec2Processor.from_pretrained("cahya/wav2vec2-large-xlsr-turkish-artificial")
model = Wav2Vec2ForCTC.from_pretrained("cahya/wav2vec2-large-xlsr-turkish-artificial") 
model.to("cuda")

chars_to_ignore_regex = '[\,\?\.\!\-\;\:\"\“\‘\”\'\`…\’»«]'

# Preprocessing the datasets.
# We need to read the aduio files as arrays
def speech_file_to_array_fn(batch):
  batch["sentence"] = re.sub(chars_to_ignore_regex, '', batch["sentence"]).lower()
  speech_array, sampling_rate = torchaudio.load(batch["path"])
  resampler = torchaudio.transforms.Resample(sampling_rate, 16_000)
  batch["speech"] = resampler(speech_array).squeeze().numpy()
  return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)

# Preprocessing the datasets.
# We need to read the aduio files as arrays
def evaluate(batch):
  inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)

  with torch.no_grad():
    logits = model(inputs.input_values.to("cuda"), attention_mask=inputs.attention_mask.to("cuda")).logits

  pred_ids = torch.argmax(logits, dim=-1)
  batch["pred_strings"] = processor.batch_decode(pred_ids)
  return batch

result = test_dataset.map(evaluate, batched=True, batch_size=8)

print("WER: {:2f}".format(100 * wer.compute(predictions=result["pred_strings"], references=result["sentence"])))

Test Result: 66.98 %

📚 Documentation

The model can be evaluated on the Turkish test data of Common Voice as shown in the advanced usage example.

The Artificial Common Voice train, validation is used to fine - tune the model. The script used for training can be found here

📄 License

This project is licensed under the Apache 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご