The wav2vec2-large-xlsr-53-romanian open-source speech recognition model enables free conversion of Romanian speech to text.

Wav2vec2 Large Xlsr 53 Romanian

Developed by gmihaila

An automatic speech recognition model fine-tuned on the Common Voice Romanian dataset based on facebook/wav2vec2-large-xlsr-53

Speech Recognition OtherOpen Source License:Apache-2.0 #Romanian speech recognition #XLSR fine-tuning #Low word error rate

Downloads 392

Release Time : 3/2/2022

Model Overview

This model is specifically designed for automatic speech recognition tasks in Romanian, supporting 16kHz sampled audio input and can be used directly without a language model.

Model Features

Specialized for Romanian

Speech recognition model optimized specifically for Romanian

No language model required

Can be used directly without additional language model support

16kHz sampling rate support

Supports standard 16kHz sampled audio input

Based on XLSR architecture

Utilizes facebook's wav2vec2-large-xlsr-53 base model

Model Capabilities

Romanian speech recognition

Automatic speech-to-text

Speech content analysis

Use Cases

Speech transcription

Romanian speech transcription

Convert Romanian speech content into text

Word error rate 28.4%

Voice assistants

Romanian voice command recognition

Used for command recognition in Romanian voice assistant systems

🚀 Wav2Vec2-Large-XLSR-53-Romanian

This model is a fine - tuned version of facebook/wav2vec2-large-xlsr-53 for Romanian, leveraging the Common Voice dataset. When using this model, ensure that your speech input is sampled at 16kHz.

🚀 Quick Start

This model is a fine - tuned version of facebook/wav2vec2-large-xlsr-53 for the Romanian language, using the Common Voice dataset. Remember to sample your speech input at 16kHz when using this model.

✨ Features

Audio Processing: Specialized for automatic speech recognition in Romanian.
Fine - Tuned: Based on the large - scale XLSR Wav2Vec2 model, fine - tuned for better performance in Romanian speech recognition.

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

import torch
import torchaudio
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

test_dataset = load_dataset("common_voice", "ro", split="test[:2%]")

processor = Wav2Vec2Processor.from_pretrained("gmihaila/wav2vec2-large-xlsr-53-romanian")
model = Wav2Vec2ForCTC.from_pretrained("gmihaila/wav2vec2-large-xlsr-53-romanian")

resampler = torchaudio.transforms.Resample(48_000, 16_000)

# Preprocessing the datasets.
# We need to read the aduio files as arrays
def speech_file_to_array_fn(batch):
    speech_array, sampling_rate = torchaudio.load(batch["path"])
    batch["speech"] = resampler(speech_array).squeeze().numpy()
    return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)
inputs = processor(test_dataset["speech"][:2], sampling_rate=16_000, return_tensors="pt", padding=True)

with torch.no_grad():
    logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits

predicted_ids = torch.argmax(logits, dim=-1)

print("Prediction:", processor.batch_decode(predicted_ids))
print("Reference:", test_dataset["sentence"][:2])

Advanced Usage

import torch
import torchaudio
from datasets import load_dataset, load_metric
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import re

test_dataset = load_dataset("common_voice", "ro", split="test")
wer = load_metric("wer")

processor = Wav2Vec2Processor.from_pretrained("gmihaila/wav2vec2-large-xlsr-53-romanian")
model = Wav2Vec2ForCTC.from_pretrained("gmihaila/wav2vec2-large-xlsr-53-romanian")
model.to("cuda")

chars_to_ignore_regex = '[\\\\,\\\\?\\\\.\\\\!\\\\-\\\\;\\\\:\\\\\"\\\\â€œ]'
resampler = torchaudio.transforms.Resample(48_000, 16_000)

# Preprocessing the datasets.
# We need to read the aduio files as arrays
def speech_file_to_array_fn(batch):
    batch["sentence"] = re.sub(chars_to_ignore_regex, '', batch["sentence"]).lower()
    speech_array, sampling_rate = torchaudio.load(batch["path"])
    batch["speech"] = resampler(speech_array).squeeze().numpy()
    return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)

# Preprocessing the datasets.
# We need to read the aduio files as arrays
def evaluate(batch):
    inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)

    with torch.no_grad():
        logits = model(inputs.input_values.to("cuda"), attention_mask=inputs.attention_mask.to("cuda")).logits

    pred_ids = torch.argmax(logits, dim=-1)
    batch["pred_strings"] = processor.batch_decode(pred_ids)
    return batch

result = test_dataset.map(evaluate, batched=True, batch_size=8)

print("WER: {:2f}".format(100 * wer.compute(predictions=result["pred_strings"], references=result["sentence"])))

📚 Documentation

Evaluation

The model can be evaluated on the Romanian test data of Common Voice. The Word Error Rate (WER) on the test set is 28.43%.

Training

The Common Voice train and validation datasets were used for training. The training script can be found here.

📄 License

This project is licensed under the Apache - 2.0 license.

📦 Model Information

Property	Details
Model Type	Fine - tuned XLSR Wav2Vec2 for Romanian
Training Data	Common Voice (train and validation datasets)
Base Model	facebook/wav2vec2-large-xlsr-53
Test WER	28.43%

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご