wav2vec2-large-xlsr-53-romanian Open-source Speech Recognition Model

Wav2vec2 Large Xlsr 53 Romanian

Developed by anton-l

An automatic speech recognition (ASR) model fine-tuned on the Common Voice Romanian dataset, based on facebook/wav2vec2-large-xlsr-53

Speech Recognition OtherOpen Source License:Apache-2.0 #Romanian speech recognition #XLSR fine-tuning #Low word error rate

Downloads 36.85k

Release Time : 3/2/2022

Model Overview

This is a speech recognition model optimized for Romanian, capable of converting Romanian speech into text.

Model Features

High-precision Romanian recognition

Achieves a 24.84% word error rate (WER) on the Common Voice test set

Based on XLSR architecture

Utilizes the powerful feature extraction capabilities of Cross-Lingual Speech Representation (XLSR)

No language model required

Can be used directly without additional language model support

Model Capabilities

Romanian speech recognition

16kHz audio processing

Batch speech-to-text conversion

Use Cases

Speech transcription

Romanian speech transcription

Convert Romanian speech content into text format

24.84% WER accuracy

Voice assistants

Romanian voice command recognition

Used for voice assistants and smart devices supporting Romanian

🚀 Wav2Vec2-Large-XLSR-53-Romanian

This model is fine-tuned from facebook/wav2vec2-large-xlsr-53 on Romanian using the Common Voice dataset. It's designed for automatic speech recognition tasks, and when using it, ensure your speech input is sampled at 16kHz.

📋 Information Table

Property	Details
Language	Romanian
Datasets	common_voice
Metrics	wer
Tags	audio, automatic-speech-recognition, speech, xlsr-fine-tuning-week
License	apache-2.0
Model Name	Romanian XLSR Wav2Vec2 Large 53 by Anton Lozhkov
Task	Speech Recognition (automatic-speech-recognition)
Dataset Name	Common Voice ro (common_voice, args: ro)
Test WER	24.84

🚀 Quick Start

Fine-tuned facebook/wav2vec2-large-xlsr-53 on Romanian using the Common Voice dataset. When using this model, make sure that your speech input is sampled at 16kHz.

💻 Usage Examples

Basic Usage

import torch
import torchaudio
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

test_dataset = load_dataset("common_voice", "ro", split="test[:2%]")

processor = Wav2Vec2Processor.from_pretrained("anton-l/wav2vec2-large-xlsr-53-romanian")
model = Wav2Vec2ForCTC.from_pretrained("anton-l/wav2vec2-large-xlsr-53-romanian")

resampler = torchaudio.transforms.Resample(48_000, 16_000)

# Preprocessing the datasets.
# We need to read the audio files as arrays
def speech_file_to_array_fn(batch):
    speech_array, sampling_rate = torchaudio.load(batch["path"])
    batch["speech"] = resampler(speech_array).squeeze().numpy()
    return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)
inputs = processor(test_dataset["speech"][:2], sampling_rate=16_000, return_tensors="pt", padding=True)

with torch.no_grad():
    logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits

predicted_ids = torch.argmax(logits, dim=-1)

print("Prediction:", processor.batch_decode(predicted_ids))
print("Reference:", test_dataset["sentence"][:2])

📚 Documentation

Evaluation

The model can be evaluated as follows on the Romanian test data of Common Voice.

import torch
import torchaudio
import urllib.request
import tarfile
import pandas as pd
from tqdm.auto import tqdm
from datasets import load_metric
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

# Download the raw data instead of using HF datasets to save disk space 
data_url = "https://voice-prod-bundler-ee1969a6ce8178826482b88e843c335139bd3fb4.s3.amazonaws.com/cv-corpus-6.1-2020-12-11/ro.tar.gz"
filestream = urllib.request.urlopen(data_url)
data_file = tarfile.open(fileobj=filestream, mode="r|gz")
data_file.extractall()

wer = load_metric("wer")

processor = Wav2Vec2Processor.from_pretrained("anton-l/wav2vec2-large-xlsr-53-romanian")
model = Wav2Vec2ForCTC.from_pretrained("anton-l/wav2vec2-large-xlsr-53-romanian")
model.to("cuda")

cv_test = pd.read_csv("cv-corpus-6.1-2020-12-11/ro/test.tsv", sep='\t')
clips_path = "cv-corpus-6.1-2020-12-11/ro/clips/"

def clean_sentence(sent):
    sent = sent.lower()
    # replace non-alpha characters with space
    sent = "".join(ch if ch.isalpha() else " " for ch in sent)
    # remove repeated spaces
    sent = " ".join(sent.split())
    return sent

targets = []
preds = []

for i, row in tqdm(cv_test.iterrows(), total=cv_test.shape[0]):
    row["sentence"] = clean_sentence(row["sentence"])
    speech_array, sampling_rate = torchaudio.load(clips_path + row["path"])
    resampler = torchaudio.transforms.Resample(sampling_rate, 16_000)
    row["speech"] = resampler(speech_array).squeeze().numpy()

    inputs = processor(row["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)

    with torch.no_grad():
        logits = model(inputs.input_values.to("cuda"), attention_mask=inputs.attention_mask.to("cuda")).logits

    pred_ids = torch.argmax(logits, dim=-1)

    targets.append(row["sentence"])
    preds.append(processor.batch_decode(pred_ids)[0])

print("WER: {:2f}".format(100 * wer.compute(predictions=preds, references=targets)))

Test Result: 24.84 %

Training

The Common Voice train and validation datasets were used for training.

📄 License

This model is licensed under the apache-2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご