Wav2vec2-large-xlsr-Georgian Open-source Model - Free Deployment, Precise Georgian Speech Recognition

Wav2vec2 Large Xlsr Georgian

Developed by xsway

Georgian automatic speech recognition model fine-tuned from facebook/wav2vec2-large-xlsr-53, supporting 16kHz sampled audio input

Speech Recognition

Transformers

OtherOpen Source License:Apache-2.0 #Georgian speech recognition #XLSR fine-tuning #Low-resource language processing

Downloads 14.80k

Release Time : 3/2/2022

Model Overview

This model is an automatic speech recognition (ASR) system optimized for Georgian, fine-tuned based on XLSR-53 architecture, suitable for Georgian speech-to-text tasks

Model Features

Georgian Optimization

Specially fine-tuned for Georgian to improve recognition accuracy for this language

No Language Model Required

Can be used directly without additional language model support

16kHz Sampling Rate Support

Supports standard 16kHz sampled audio input

Model Capabilities

Georgian speech recognition

Speech-to-text

Automatic speech recognition

Use Cases

Speech Transcription

Georgian Speech Transcription

Convert Georgian speech content into text

Word Error Rate 45.28%

Voice Assistants

Georgian Voice Command Recognition

Used for voice command recognition in Georgian voice assistant systems

🚀 Wav2Vec2-Large-XLSR-53-Georgian

This model is fine-tuned from facebook/wav2vec2-large-xlsr-53 on Georgian using the Common Voice. It's designed for speech recognition tasks in Georgian.

🚀 Quick Start

Fine-tuned facebook/wav2vec2-large-xlsr-53 on Georgian using the Common Voice. When using this model, make sure that your speech input is sampled at 16kHz.

✨ Features

Dataset: Fine-tuned on the Georgian subset of the Common Voice dataset.
Metrics: Evaluated using Word Error Rate (WER).
Task: Suitable for automatic speech recognition tasks in Georgian.

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

import librosa
import torch
import torchaudio
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor


test_dataset = load_dataset("common_voice", "ka", split="test[:2%]") 

processor = Wav2Vec2Processor.from_pretrained("xsway/wav2vec2-large-xlsr-georgian")
model = Wav2Vec2ForCTC.from_pretrained("xsway/wav2vec2-large-xlsr-georgian") 

resampler = lambda sampling_rate, y: librosa.resample(y.numpy().squeeze(), sampling_rate, 16_000)

# Preprocessing the datasets.
# We need to read the audio files as arrays
def speech_file_to_array_fn(batch):
    speech_array, sampling_rate = torchaudio.load(batch["path"])
    batch["speech"] = resampler(sampling_rate, speech_array).squeeze()
    return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)
inputs = processor(test_dataset["speech"][:2], sampling_rate=16_000, return_tensors="pt", padding=True)

with torch.no_grad():
    logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits

predicted_ids = torch.argmax(logits, dim=-1)

print("Prediction:", processor.batch_decode(predicted_ids))
print("Reference:", test_dataset["sentence"][:2])

Advanced Usage

import torch
import torchaudio
from datasets import load_dataset, load_metric
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import re
import librosa

test_dataset = load_dataset("common_voice", "ka", split="test") 
wer = load_metric("wer")

processor = Wav2Vec2Processor.from_pretrained("xsway/wav2vec2-large-xlsr-georgian") 
model = Wav2Vec2ForCTC.from_pretrained("xsway/wav2vec2-large-xlsr-georgian") 
model.to("cuda")

chars_to_ignore_regex = '[\\\\,\\\\?\\\\.\\\\!\\\\-\\\\;\\\\:\\\\\"\\\\“]' 
resampler = lambda sampling_rate, y: librosa.resample(y.numpy().squeeze(), sampling_rate, 16_000)

# Preprocessing the datasets.
# We need to read the audio files as arrays
def speech_file_to_array_fn(batch):
  batch["sentence"] = re.sub(chars_to_ignore_regex, '', batch["sentence"]).lower()
  speech_array, sampling_rate = torchaudio.load(batch["path"])
  batch["speech"] = resampler(sampling_rate, speech_array).squeeze()
  return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)

# Preprocessing the datasets.
# We need to read the audio files as arrays
def evaluate(batch):
  inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)

  with torch.no_grad():
    logits = model(inputs.input_values.to("cuda"), attention_mask=inputs.attention_mask.to("cuda")).logits

  pred_ids = torch.argmax(logits, dim=-1)
  batch["pred_strings"] = processor.batch_decode(pred_ids)
  return batch

result = test_dataset.map(evaluate, batched=True, batch_size=8)

print("WER: {:2f}".format(100 * wer.compute(predictions=result["pred_strings"], references=result["sentence"])))

📚 Documentation

Model Information

Property	Details
Model Type	XLSR Wav2Vec finetuned for Georgian
Training Data	Common Voice `train` and `validation` datasets for Georgian
Metrics	Word Error Rate (WER)
Test Result	45.28 %

Training

The Common Voice train, validation datasets were used for training. The script used for training can be found here

📄 License

This model is licensed under the Apache 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご