Open-source Speech Recognition Model wav2vec2-large-xlsr-greek-2 - Accurately Recognize Greek with Distinct Training Data

Wav2vec2 Large Xlsr Greek 2

Developed by skylord

A speech recognition model fine-tuned on the Greek Common Voice dataset based on facebook/wav2vec2-large-xlsr-53, balancing the training set with synthesized female voice data

Speech Recognition

Transformers

OtherOpen Source License:Apache-2.0 #Greek speech recognition #Wav2Vec2 fine-tuning #Multi-gender data balancing

Downloads 15

Release Time : 3/2/2022

Model Overview

This is an automatic speech recognition (ASR) model for Greek, fine-tuned from Facebook's XLSR-53 large model, specifically addressing gender imbalance issues in Greek speech data

Model Features

Gender-balanced training data

Solved the male-dominated issue in the original dataset by synthesizing female voice data using Google TTS

Multi-stage fine-tuning

Adopted a phased fine-tuning strategy, first training on original data, then continuing training with synthesized data

Greek language optimization

Specifically optimized for Greek speech characteristics, handling unique Greek pronunciations and intonations

Model Capabilities

Greek speech recognition

16kHz audio processing

Direct inference without language model

Use Cases

Speech-to-text

Greek speech transcription

Convert Greek speech content into text

Achieved 45.05% WER on Common Voice test set

Voice assistants

Greek voice command recognition

Basic speech recognition component for Greek voice assistants

🚀 Wav2Vec2-Large-XLSR-53-Greek

This model is fine-tuned from facebook/wav2vec2-large-xlsr-53 on Greek data from Common Voice, aiming to provide high - quality automatic speech recognition for the Greek language.

Property	Details
Model Type	Fine - tuned Wav2Vec2 - Large - XLSR - 53 for Greek
Training Data	Common Voice Greek dataset. To balance the gender ratio in the data, synthesized female voices were created using [Google's TTS Standard Voice model](https://cloud.google.com/text - to - speech) based on the text from the Common Voice dataset.
Metrics	Word Error Rate (WER)
License	Apache - 2.0

⚠️ Important Note

When using this model, make sure that your speech input is sampled at 16kHz.

🚀 Quick Start

This model is fine - tuned from facebook/wav2vec2-large-xlsr-53 on Greek using the Common Voice. The Greek CV data has a majority of male voices. To balance it, synthesised female voices were created using the approach discussed here. The text from the common - voice dataset was used to synthesize voices of female speakers using [Google's TTS Standard Voice model](https://cloud.google.com/text - to - speech).

The fine - tuning results are as follows:

Fine - tuned on facebook/wav2vec2-large-xlsr-53 using Greek CommonVoice for 5 epochs >> 56.25% WER
Resuming from checkpoints and trained for another 15 epochs >> 34.00% WER
Added synthesised female voices and trained for 12 epochs >> 34.00% WER (no change)

💻 Usage Examples

Basic Usage

import torch
import torchaudio
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
test_dataset = load_dataset("common_voice", "el", split="test[:2%]")

processor = Wav2Vec2Processor.from_pretrained("skylord/greek_lsr_1") 
model = Wav2Vec2ForCTC.from_pretrained("skylord/greek_lsr_1") 

resampler = torchaudio.transforms.Resample(48_000, 16_000)

# Preprocessing the datasets.
# We need to read the aduio files as arrays
def speech_file_to_array_fn(batch):
  speech_array, sampling_rate = torchaudio.load(batch["path"])
  batch["speech"] = resampler(speech_array).squeeze().numpy()
  return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)
inputs = processor(test_dataset["speech"][:2], sampling_rate=16_000, return_tensors="pt", padding=True)

with torch.no_grad():
  logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits
  
predicted_ids = torch.argmax(logits, dim=-1)
print("Prediction:", processor.batch_decode(predicted_ids))
print("Reference:", test_dataset["sentence"][:2])

📚 Documentation

Evaluation

The model can be evaluated as follows on the Greek test data of Common Voice.

import torch
import torchaudio
from datasets import load_dataset, load_metric
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import re

test_dataset = load_dataset("common_voice", "el", split="test") 
wer = load_metric("wer")

processor = Wav2Vec2Processor.from_pretrained("skylord/greek_lsr_1") 
model = Wav2Vec2ForCTC.from_pretrained("skylord/greek_lsr_1")
model.to("cuda")

chars_to_ignore_regex = '[\\\\\\\\,\\\\\\\\?\\\\\\\\.\\\\\\\\!\\\\\\\\-\\\\\\\\;\\\\\\\\:\\\\\\\\"\\\\\\\\“]' 
resampler = torchaudio.transforms.Resample(48_000, 16_000)

# Preprocessing the datasets.
# We need to read the aduio files as arrays

def speech_file_to_array_fn(batch):
  batch["sentence"] = re.sub(chars_to_ignore_regex, '', batch["sentence"]).lower()
  speech_array, sampling_rate = torchaudio.load(batch["path"])
  batch["speech"] = resampler(speech_array).squeeze().numpy()
  return batch
  
test_dataset = test_dataset.map(speech_file_to_array_fn)

# Preprocessing the datasets.
# We need to read the aduio files as arrays

def evaluate(batch):
  inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)
  with torch.no_grad():
    logits = model(inputs.input_values.to("cuda"), attention_mask=inputs.attention_mask.to("cuda")).logits
  pred_ids = torch.argmax(logits, dim=-1)
  batch["pred_strings"] = processor.batch_decode(pred_ids)
  return batch

result = test_dataset.map(evaluate, batched=True, batch_size=8)
print("WER: {:2f}".format(100 * wer.compute(predictions=result["pred_strings"], references=result["sentence"])))

Test Result: 45.048955 %

Training

The Common Voice train and validation datasets were used for training.

The script used for training can be found here # TODO: fill in a link to your training script here. If you trained your model in a colab, simply fill in the link here. If you trained the model locally, it would be great if you could upload the training script on github and paste the link here.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご