wav2vec2-large-xlsr-53-greek Open-source Greek Speech Recognition Model

Home

Wav2vec2 Large Xlsr 53 Greek

Developed by vasilis

A Greek speech recognition model fine-tuned from facebook/wav2vec2-large-xlsr-53, supporting 16kHz audio input.

Speech Recognition

Transformers

OtherOpen Source License:Apache-2.0 #Greek speech recognition #XLSR fine-tuning #Multi-dataset training

Downloads 25

Release Time : 3/2/2022

Model Overview

This is an optimized automatic speech recognition (ASR) model for Greek, based on the Wav2Vec2 architecture, fine-tuned using the Common Voice and CSS10 Greek single-speaker datasets.

Model Features

Multi-dataset fine-tuning

Trained with both Common Voice and CSS10 Greek single-speaker datasets to improve recognition accuracy

Text normalization

Standardizes Greek special characters (e.g., converting ς to σ) for better recognition

No language model required

Can be used directly for speech recognition without additional language model support

Model Capabilities

Greek speech recognition

16kHz audio processing

Real-time speech-to-text

Use Cases

Transcription

Greek meeting minutes

Automatically transcribe Greek meeting recordings into text

Word Error Rate 18.99%, Character Error Rate 5.78%

Voice assistants

Speech recognition module for Greek voice assistant applications

Education

Language learning apps

Help learners practice Greek pronunciation and listening

🚀 Wav2Vec2-Large-XLSR-53-greek

This model is fine-tuned from facebook/wav2vec2-large-xlsr-53 on Greek, aiming to provide high - quality automatic speech recognition for Greek.

🚀 Quick Start

This model is a fine - tuned version of facebook/wav2vec2-large-xlsr-53 on Greek, using the Common Voice and CSS10 Greek: Single Speaker Speech Dataset. When using this model, ensure that your speech input is sampled at 16kHz.

✨ Features

Multidataset Training: Trained on both the Common Voice and CSS10 Greek: Single Speaker Speech Dataset, which enriches the model's understanding of Greek speech.
Text Pre - processing: During training, text pre - processing techniques such as normalizing the letter ς to σ and removing accents from letters are applied, which can improve the model's performance.

📦 Installation

No specific installation steps are provided in the original document, so this section is skipped.

💻 Usage Examples

Basic Usage

The model can be used directly (without a language model) as follows:

import torch
import torchaudio
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

test_dataset = load_dataset("common_voice", "el", split="test[:2%]") #TODO: replace {lang_id} in your language code here. Make sure the code is one of the *ISO codes* of [this](https://huggingface.co/languages) site.

processor = Wav2Vec2Processor.from_pretrained("vasilis/wav2vec2-large-xlsr-53-greek") #TODO: replace {model_id} with your model id. The model id consists of {your_username}/{your_modelname}, *e.g.* `elgeish/wav2vec2-large-xlsr-53-arabic`
model = Wav2Vec2ForCTC.from_pretrained("vasilis/wav2vec2-large-xlsr-53-greek") #TODO: replace {model_id} with your model id. The model id consists of {your_username}/{your_modelname}, *e.g.* `elgeish/wav2vec2-large-xlsr-53-arabic`

resampler = torchaudio.transforms.Resample(48_000, 16_000)

# Preprocessing the datasets.
# We need to read the aduio files as arrays
def speech_file_to_array_fn(batch):
    speech_array, sampling_rate = torchaudio.load(batch["path"])
    batch["speech"] = resampler(speech_array).squeeze().numpy()
    return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)
inputs = processor(test_dataset["speech"][:2], sampling_rate=16_000, return_tensors="pt", padding=True)

with torch.no_grad():
    logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits

predicted_ids = torch.argmax(logits, dim=-1)

print("Prediction:", processor.batch_decode(predicted_ids))
print("Reference:", test_dataset["sentence"][:2])

Advanced Usage

The model can be evaluated as follows on the Greek test data of Common Voice:

import torch
import torchaudio
from datasets import load_dataset, load_metric
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import re

test_dataset = load_dataset("common_voice", "el", split="test") #TODO: replace {lang_id} in your language code here. Make sure the code is one of the *ISO codes* of [this](https://huggingface.co/languages) site.
wer = load_metric("wer")

processor = Wav2Vec2Processor.from_pretrained("vasilis/wav2vec2-large-xlsr-53-greek") #TODO: replace {model_id} with your model id. The model id consists of {your_username}/{your_modelname}, *e.g.* `elgeish/wav2vec2-large-xlsr-53-arabic`
model = Wav2Vec2ForCTC.from_pretrained("vasilis/wav2vec2-large-xlsr-53-greek") #TODO: replace {model_id} with your model id. The model id consists of {your_username}/{your_modelname}, *e.g.* `elgeish/wav2vec2-large-xlsr-53-arabic`
model.to("cuda")

chars_to_ignore_regex = '[\,\?\.\!\-\;\:\"\“]' # TODO: adapt this list to include all special characters you removed from the data

normalize_greek_letters = {"ς": "σ"}
# normalize_greek_letters = {"ά": "α", "έ": "ε", "ί": "ι", 'ϊ': "ι", "ύ": "υ", "ς": "σ", "ΐ": "ι", 'ϋ': "υ", "ή": "η", "ώ": "ω", 'ό': "ο"}
remove_chars_greek = {"a": "", "h": "", "n": "", "g": "", "o": "", "v": "", "e": "", "r": "", "t": "", "«": "", "»": "", "m": "", '́': '', "·": "", "’": "", '´': ""}
replacements = {**normalize_greek_letters, **remove_chars_greek}

resampler = {
    48_000: torchaudio.transforms.Resample(48_000, 16_000),
    44100: torchaudio.transforms.Resample(44100, 16_000),
    32000: torchaudio.transforms.Resample(32000, 16_000)
}


# Preprocessing the datasets.
# We need to read the aduio files as arrays
def speech_file_to_array_fn(batch):
    batch["sentence"] = re.sub(chars_to_ignore_regex, '', batch["sentence"]).lower()
    for key, value in replacements.items():
        batch["sentence"] = batch["sentence"].replace(key, value)
    speech_array, sampling_rate = torchaudio.load(batch["path"])
    batch["speech"] = resampler[sampling_rate](speech_array).squeeze().numpy()
    return batch


test_dataset = test_dataset.map(speech_file_to_array_fn)

# Preprocessing the datasets.
# We need to read the aduio files as arrays
def evaluate(batch):
    inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)

    with torch.no_grad():
        logits = model(inputs.input_values.to("cuda"), attention_mask=inputs.attention_mask.to("cuda")).logits

    pred_ids = torch.argmax(logits, dim=-1)
    batch["pred_strings"] = processor.batch_decode(pred_ids)
    return batch

result = test_dataset.map(evaluate, batched=True, batch_size=8)

print("WER: {:2f}".format(100 * wer.compute(predictions=result["pred_strings"], references=result["sentence"])))
print("CER: {:2f}".format(100 * wer.compute(predictions=[" ".join(list(entry)) for entry in result["pred_strings"]], references=[" ".join(list(entry)) for entry in result["sentence"]])))

Test Result: 18.996669 %

📚 Documentation

Training

The Common Voice train dataset was used for training. Also, all of CSS10 Greek was used with normalized transcripts. During text pre - processing, the letter ς is normalized to σ because both letters sound the same, and ς is only used as the ending character of words. So, the change can be easily mapped to proper dictation. Removing all accents from letters was also tried, which significantly improved WER. The model could easily reach 17% WER without converging. However, the text pre - processing needed to fix transcriptions would be more complicated. A language model should be able to fix things easily. Another approach that could be tried is to change all of ι, η, etc. to a single character since they all sound the same. Similarly, for o and ω, this should significantly help the acoustic model part as all these characters map to the same sound. But further text normalization would be needed.

🔧 Technical Details

Property	Details
Model Type	Fine - tuned facebook/wav2vec2-large-xlsr-53 for Greek speech recognition
Training Data	Common Voice train dataset and all of `CSS10 Greek` with normalized transcripts
Metrics	Test WER: 18.996669%, Test CER: 5.781874%

📄 License

This model is licensed under the apache - 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご