wav2vec2-large-xlsr-53-greek Open-source Speech Recognition Model

Wav2vec2 Large Xlsr 53 Greek

Developed by PereLluis13

This is a Greek fine-tuned speech recognition model based on facebook/wav2vec2-large-xlsr-53, trained using the Common Voice and CSS10 datasets.

Speech Recognition OtherOpen Source License:Apache-2.0 #Greek speech recognition #Multi-dataset fine-tuning #Low WER rate

Downloads 21

Release Time : 3/2/2022

Model Overview

This model is used for Greek Automatic Speech Recognition (ASR) tasks, capable of converting Greek speech into text.

Model Features

Multi-dataset training

Combined training with both Common Voice and CSS10 datasets to improve model generalization

High performance

Achieves 20.89% WER on the Common Voice Greek test set

16kHz sampling rate support

Optimized specifically for 16kHz sampled speech input

Model Capabilities

Greek speech recognition

Speech-to-text

Automatic speech recognition

Use Cases

Speech transcription

Greek speech transcription

Convert Greek speech content into text

Achieves 20.89% WER on Common Voice test set

Voice assistants

Greek voice command recognition

Used for command recognition in Greek voice assistants

🚀 Wav2Vec2-Large-XLSR-53-greek

Fine-tuned facebook/wav2vec2-large-xlsr-53 on Greek using the Common Voice and CSS10 datasets. Ensures speech input is sampled at 16kHz.

📋 Metadata

Property	Details
Language	el
Datasets	common_voice, CSS10
Metrics	wer
Tags	audio, automatic-speech-recognition, speech, xlsr-fine-tuning-week
License	apache-2.0
Model Name	Greek XLSR Wav2Vec2 Large 53 - CV + CSS10
Task	Speech Recognition (automatic-speech-recognition)
Dataset	Common Voice el
Metric (Test WER)	20.89

🚀 Quick Start

This model is a fine - tuned version of facebook/wav2vec2-large-xlsr-53 on Greek, leveraging the Common Voice and CSS10 datasets. Remember to sample your speech input at 16kHz when using this model.

💻 Usage Examples

Basic Usage

import torch
import torchaudio
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

test_dataset = load_dataset("common_voice", "el", split="test") 

processor = Wav2Vec2Processor.from_pretrained("PereLluis13/wav2vec2-large-xlsr-53-greek") 
model = Wav2Vec2ForCTC.from_pretrained("PereLluis13/wav2vec2-large-xlsr-53-greek")

resampler = torchaudio.transforms.Resample(48_000, 16_000)

# Preprocessing the datasets.
# We need to read the aduio files as arrays
def speech_file_to_array_fn(batch):
    speech_array, sampling_rate = torchaudio.load(batch["path"])
    batch["speech"] = resampler(speech_array).squeeze().numpy()
    return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)
inputs = processor(test_dataset["speech"][:2], sampling_rate=16_000, return_tensors="pt", padding=True)

with torch.no_grad():
    logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits

predicted_ids = torch.argmax(logits, dim=-1)

print("Prediction:", processor.batch_decode(predicted_ids))
print("Reference:", test_dataset["sentence"][:2])

📈 Evaluation

The following code demonstrates how to evaluate the model on the Greek test data of Common Voice:

import torch
import torchaudio
from datasets import load_dataset, load_metric
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import re

test_dataset = load_dataset("common_voice", "el", split="test") 
wer = load_metric("wer")

processor = Wav2Vec2Processor.from_pretrained("PereLluis13/wav2vec2-large-xlsr-53-greek") 
model = Wav2Vec2ForCTC.from_pretrained("PereLluis13/wav2vec2-large-xlsr-53-greek") 
model.to("cuda")

chars_to_ignore_regex = '[\,\?\.\!\-\;\:\"\“\%\‘\”\�]'


resampler = torchaudio.transforms.Resample(48_000, 16_000)

# Preprocessing the datasets.
# We need to read the aduio files as arrays
def speech_file_to_array_fn(batch):
    batch["sentence"] = re.sub(chars_to_ignore_regex, '', batch["sentence"]).lower()
    speech_array, sampling_rate = torchaudio.load(batch["path"])
    batch["speech"] = resampler(speech_array).squeeze().numpy()
    return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)

# Preprocessing the datasets.
# We need to read the aduio files as arrays
def evaluate(batch):
    inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)

    with torch.no_grad():
        logits = model(inputs.input_values.to("cuda"), attention_mask=inputs.attention_mask.to("cuda")).logits

    pred_ids = torch.argmax(logits, dim=-1)
    batch["pred_strings"] = processor.batch_decode(pred_ids)
    return batch

result = test_dataset.map(evaluate, batched=True, batch_size=8)

print("WER: {:2f}".format(100 * wer.compute(predictions=result["pred_strings"], references=result["sentence"])))

Test Result: 20.89 %

🔧 Training

The training process utilized the train and validation splits of the Common Voice dataset, along with the CSS10 dataset added as an extra split. Due to differences in sampling rate and format of CSS10 files, the speech_file_to_array_fn function was modified as follows:

    def speech_file_to_array_fn(batch):
        try:
            speech_array, sampling_rate = sf.read(batch["path"] + ".wav")
        except:
            speech_array, sampling_rate = librosa.load(batch["path"], sr = 16000, res_type='zero_order_hold')
            sf.write(batch["path"] + ".wav", speech_array, sampling_rate, subtype='PCM_24')
        batch["speech"] = speech_array
        batch["sampling_rate"] = sampling_rate
        batch["target_text"] = batch["text"]
        return batch

This modification was suggested by Florian Zimmermeister.

The training script can be found in run_common_voice.py, pending a PR. The only change was to the speech_file_to_array_fn function. The batch size was set to 32 (using gradient_accumulation_steps) on an OVH machine with a V100 GPU. The model was trained for 40 epochs, with the first 20 epochs using the train+validation splits, and then the extra split with CSS10 data was added at the 20th epoch.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご