wav2vec2-Georgian-Daytona Open-Source Speech Recognition Model - Free Deployment for Precise Georgian Language Recognition

Wav2vec2 Georgian Daytona

Developed by Temur

A Georgian speech recognition model fine-tuned based on facebook/wav2vec2-large-xlsr-53, trained on the Common Voice dataset

Speech Recognition OtherOpen Source License:Apache-2.0 #Georgian speech recognition #XLSR fine-tuned model #Low-resource language processing

Downloads 19

Release Time : 3/2/2022

Model Overview

This is an automatic speech recognition (ASR) model optimized for Georgian, capable of converting Georgian audio into text

Model Features

Georgian optimization

Specially fine-tuned for Georgian, improving recognition accuracy for this language

Based on XLSR large model

Built upon the facebook/wav2vec2-large-xlsr-53 model, inheriting its powerful speech feature extraction capabilities

16kHz sampling rate support

Supports 16kHz sampling rate audio input, suitable for most speech application scenarios

Model Capabilities

Georgian speech recognition

Audio to text conversion

Automatic speech transcription

Use Cases

Speech transcription

Georgian speech to text

Convert Georgian speech content into editable text format

Word Error Rate (WER) 48.34%

Voice assistants

Georgian voice command recognition

Used to build voice assistants and control systems supporting Georgian

🚀 Wav2Vec2-Large-XLSR-53-Georgian

This model is fine-tuned from facebook/wav2vec2-large-xlsr-53 on Georgian using the Common Voice dataset. It is designed for automatic speech recognition in the Georgian language.

Property	Details
Datasets	common_voice
Metrics	wer
Tags	audio, automatic-speech-recognition, speech, xlsr-fine-tuning-week
License	apache-2.0

Model Index

Name: Georgian WAV2VEC2 Daytona
- Results:
  - Task:
    - Name: Speech Recognition
    - Type: automatic-speech-recognition
  - Dataset:
    - Name: Common Voice ka
    - Type: common_voice
    - Args: ka
  - Metrics:
    - Name: Test WER
    - Type: wer
    - Value: 48.34

🚀 Quick Start

Fine-tuned facebook/wav2vec2-large-xlsr-53 on Georgian using the Common Voice dataset. When using this model, make sure that your speech input is sampled at 16kHz.

💻 Usage Examples

Basic Usage

import torch
import torchaudio
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

test_dataset = load_dataset("common_voice", "ka", split="test[:2%]") 

processor = Wav2Vec2Processor.from_pretrained("Temur/wav2vec2-Georgian-Daytona") 
model = Wav2Vec2ForCTC.from_pretrained("Temur/wav2vec2-Georgian-Daytona")

resampler = torchaudio.transforms.Resample(48_000, 16_000)

# Preprocessing the datasets.
# We need to read the aduio files as arrays
def speech_file_to_array_fn(batch):
    speech_array, sampling_rate = torchaudio.load(batch["path"])
    batch["speech"] = resampler(speech_array).squeeze().numpy()
    return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)
inputs = processor(test_dataset["speech"][:2], sampling_rate=16_000, return_tensors="pt", padding=True)

with torch.no_grad():
    logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits

predicted_ids = torch.argmax(logits, dim=-1)

print("Prediction:", processor.batch_decode(predicted_ids))
print("Reference:", test_dataset["sentence"][:2])

📚 Documentation

Evaluation

The model can be evaluated as follows on the Georgian test data of Common Voice.

import torch
import torchaudio
from datasets import load_dataset, load_metric
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import re

test_dataset = load_dataset("common_voice", "ka", split="test") 
wer = load_metric("wer")

processor = Wav2Vec2Processor.from_pretrained("Temur/wav2vec2-Georgian-Daytona") 
model = Wav2Vec2ForCTC.from_pretrained("Temur/wav2vec2-Georgian-Daytona")
model.to("cuda")

chars_to_ignore_regex = '[\\\\,\\\\?\\\\.\\\\!\\\\-\\\\;\\\\:\\\\"\\\\“]'  # TODO: adapt this list to include all special characters you removed from the data
resampler = torchaudio.transforms.Resample(48_000, 16_000)

# Preprocessing the datasets.
# We need to read the aduio files as arrays
def speech_file_to_array_fn(batch):
    batch["sentence"] = re.sub(chars_to_ignore_regex, '', batch["sentence"]).lower()
    speech_array, sampling_rate = torchaudio.load(batch["path"])
    batch["speech"] = resampler(speech_array).squeeze().numpy()
    return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)

# Preprocessing the datasets.
# We need to read the aduio files as arrays
def evaluate(batch):
    inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)

    with torch.no_grad():
        logits = model(inputs.input_values.to("cuda"), attention_mask=inputs.attention_mask.to("cuda")).logits

    pred_ids = torch.argmax(logits, dim=-1)
    batch["pred_strings"] = processor.batch_decode(pred_ids)
    return batch

result = test_dataset.map(evaluate, batched=True, batch_size=8)

print("WER: {:2f}".format(100 * wer.compute(predictions=result["pred_strings"], references=result["sentence"])))

Test Result: 48.34 %

Training

The Common Voice train, validation, and ... datasets were used for training as well as ... and ... # TODO: adapt to state all the datasets that were used for training.

The script used for training can be found here

📄 License

This project is licensed under the Apache-2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご