đ Wav2Vec2-Large-XLSR-53-greek
Fine-tuned facebook/wav2vec2-large-xlsr-53 on Greek using the Common Voice and CSS10 datasets. Ensures speech input is sampled at 16kHz.
đ Metadata
Property |
Details |
Language |
el |
Datasets |
common_voice, CSS10 |
Metrics |
wer |
Tags |
audio, automatic-speech-recognition, speech, xlsr-fine-tuning-week |
License |
apache-2.0 |
Model Name |
Greek XLSR Wav2Vec2 Large 53 - CV + CSS10 |
Task |
Speech Recognition (automatic-speech-recognition) |
Dataset |
Common Voice el |
Metric (Test WER) |
20.89 |
đ Quick Start
This model is a fine - tuned version of facebook/wav2vec2-large-xlsr-53 on Greek, leveraging the Common Voice and CSS10 datasets. Remember to sample your speech input at 16kHz when using this model.
đģ Usage Examples
Basic Usage
import torch
import torchaudio
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
test_dataset = load_dataset("common_voice", "el", split="test")
processor = Wav2Vec2Processor.from_pretrained("PereLluis13/wav2vec2-large-xlsr-53-greek")
model = Wav2Vec2ForCTC.from_pretrained("PereLluis13/wav2vec2-large-xlsr-53-greek")
resampler = torchaudio.transforms.Resample(48_000, 16_000)
def speech_file_to_array_fn(batch):
speech_array, sampling_rate = torchaudio.load(batch["path"])
batch["speech"] = resampler(speech_array).squeeze().numpy()
return batch
test_dataset = test_dataset.map(speech_file_to_array_fn)
inputs = processor(test_dataset["speech"][:2], sampling_rate=16_000, return_tensors="pt", padding=True)
with torch.no_grad():
logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits
predicted_ids = torch.argmax(logits, dim=-1)
print("Prediction:", processor.batch_decode(predicted_ids))
print("Reference:", test_dataset["sentence"][:2])
đ Evaluation
The following code demonstrates how to evaluate the model on the Greek test data of Common Voice:
import torch
import torchaudio
from datasets import load_dataset, load_metric
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import re
test_dataset = load_dataset("common_voice", "el", split="test")
wer = load_metric("wer")
processor = Wav2Vec2Processor.from_pretrained("PereLluis13/wav2vec2-large-xlsr-53-greek")
model = Wav2Vec2ForCTC.from_pretrained("PereLluis13/wav2vec2-large-xlsr-53-greek")
model.to("cuda")
chars_to_ignore_regex = '[\,\?\.\!\-\;\:\"\â\%\â\â\īŋŊ]'
resampler = torchaudio.transforms.Resample(48_000, 16_000)
def speech_file_to_array_fn(batch):
batch["sentence"] = re.sub(chars_to_ignore_regex, '', batch["sentence"]).lower()
speech_array, sampling_rate = torchaudio.load(batch["path"])
batch["speech"] = resampler(speech_array).squeeze().numpy()
return batch
test_dataset = test_dataset.map(speech_file_to_array_fn)
def evaluate(batch):
inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)
with torch.no_grad():
logits = model(inputs.input_values.to("cuda"), attention_mask=inputs.attention_mask.to("cuda")).logits
pred_ids = torch.argmax(logits, dim=-1)
batch["pred_strings"] = processor.batch_decode(pred_ids)
return batch
result = test_dataset.map(evaluate, batched=True, batch_size=8)
print("WER: {:2f}".format(100 * wer.compute(predictions=result["pred_strings"], references=result["sentence"])))
Test Result: 20.89 %
đ§ Training
The training process utilized the train
and validation
splits of the Common Voice dataset, along with the CSS10 dataset added as an extra
split. Due to differences in sampling rate and format of CSS10 files, the speech_file_to_array_fn
function was modified as follows:
def speech_file_to_array_fn(batch):
try:
speech_array, sampling_rate = sf.read(batch["path"] + ".wav")
except:
speech_array, sampling_rate = librosa.load(batch["path"], sr = 16000, res_type='zero_order_hold')
sf.write(batch["path"] + ".wav", speech_array, sampling_rate, subtype='PCM_24')
batch["speech"] = speech_array
batch["sampling_rate"] = sampling_rate
batch["target_text"] = batch["text"]
return batch
This modification was suggested by Florian Zimmermeister.
The training script can be found in run_common_voice.py, pending a PR. The only change was to the speech_file_to_array_fn
function. The batch size was set to 32 (using gradient_accumulation_steps
) on an OVH machine with a V100 GPU. The model was trained for 40 epochs, with the first 20 epochs using the train+validation
splits, and then the extra
split with CSS10 data was added at the 20th epoch.