wav2vec2-large-xlsr-rm-sursilv Open-source Speech Recognition Model - Free Recognition of the Sursilvan Dialect of Romansh

Wav2vec2 Large Xlsr Rm Sursilv

Developed by gchhablani

This is an automatic speech recognition model fine-tuned from the facebook/wav2vec2-large-xlsr-53 model, specifically designed for recognizing the Sursilvan dialect of Romansh.

Speech Recognition Open Source License:Apache-2.0 #Sursilvan dialect recognition #Low-resource speech recognition #Wav2Vec2 fine-tuning

Downloads 27

Release Time : 3/2/2022

Model Overview

The model is fine-tuned using the Romansh Sursilvan dialect data from the Common Voice dataset, suitable for speech recognition tasks, and supports 16kHz sampled audio input.

Model Features

High Accuracy Speech Recognition

Achieves a 25.16% Word Error Rate (WER) on the Common Voice test set.

Low-Resource Language Support

Specifically optimized for the Sursilvan dialect of Romansh, suitable for low-resource language scenarios.

No Language Model Required

Can be used directly without additional language model support.

Model Capabilities

Speech Recognition

Audio to Text

Romansh Language Processing

Use Cases

Speech Transcription

Dialect Speech Transcription

Convert speech in the Romansh Sursilvan dialect to text

25.16% Word Error Rate

Voice Assistants

Dialect Voice Assistant

Provide voice interaction capabilities for Romansh speakers

🚀 Wav2Vec2-Large-XLSR-53-Romansh-Sursilvan

This model is fine-tuned from facebook/wav2vec2-large-xlsr-53 on Romansh Sursilvan using the Common Voice dataset. It's designed for automatic speech recognition tasks.

🚀 Quick Start

When using this model, make sure that your speech input is sampled at 16kHz.

✨ Features

Fine-tuned on Romansh Sursilvan with the Common Voice dataset.
Suitable for automatic speech recognition tasks.

💻 Usage Examples

Basic Usage

import torch
import torchaudio
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

test_dataset = load_dataset("common_voice", "rm-sursilv", split="test[:2%]")

processor = Wav2Vec2Processor.from_pretrained("gchhablani/wav2vec2-large-xlsr-rm-sursilv")
model = Wav2Vec2ForCTC.from_pretrained("gchhablani/wav2vec2-large-xlsr-rm-sursilv")

resampler = torchaudio.transforms.Resample(48_000, 16_000)

# Preprocessing the datasets.
# We need to read the aduio files as arrays
def speech_file_to_array_fn(batch):
    speech_array, sampling_rate = torchaudio.load(batch["path"])
    batch["speech"] = resampler(speech_array).squeeze().numpy()
    return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)
inputs = processor(test_dataset["speech"][:2], sampling_rate=16_000, return_tensors="pt", padding=True)

with torch.no_grad():
    logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits

predicted_ids = torch.argmax(logits, dim=-1)

print("Prediction:", processor.batch_decode(predicted_ids))
print("Reference:", test_dataset["sentence"][:2])

Advanced Usage

The following code shows how to evaluate the model on the Portuguese test data of Common Voice.

import torch
import torchaudio
from datasets import load_dataset, load_metric
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import re

test_dataset = load_dataset("common_voice", "rm-sursilv", split="test")

wer = load_metric("wer")

processor = Wav2Vec2Processor.from_pretrained("gchhablani/wav2vec2-large-xlsr-rm-sursilv")
model = Wav2Vec2ForCTC.from_pretrained("gchhablani/wav2vec2-large-xlsr-rm-sursilv")
model.to("cuda")

chars_to_ignore_regex = '[\\,\\?\\.\\!\\-\\;\\:\\\"\\“\\%\\‘\\”\\�\\…\\«\\»\\–]'
resampler = torchaudio.transforms.Resample(48_000, 16_000)

# Preprocessing the datasets.
# We need to read the aduio files as arrays
def speech_file_to_array_fn(batch):
    batch["sentence"] = re.sub(chars_to_ignore_regex, '', batch["sentence"]).lower()
    speech_array, sampling_rate = torchaudio.load(batch["path"])
    batch["speech"] = resampler(speech_array).squeeze().numpy()
    return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)

# Preprocessing the datasets.
# We need to read the aduio files as arrays
def evaluate(batch):
    inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)

    with torch.no_grad():
        logits = model(inputs.input_values.to("cuda"), attention_mask=inputs.attention_mask.to("cuda")).logits

        pred_ids = torch.argmax(logits, dim=-1)
        batch["pred_strings"] = processor.batch_decode(pred_ids)
    return batch

result = test_dataset.map(evaluate, batched=True, batch_size=8)

print("WER: {:2f}".format(100 * wer.compute(predictions=result["pred_strings"], references=result["sentence"])))

Test Result: 25.16 %

📚 Documentation

Training

The Common Voice train and validation datasets were used for training. The code can be found here.

📄 License

This model is licensed under the apache-2.0 license.

📦 Model Information

Property	Details
Model Type	Wav2Vec2 Large 53 Romansh Sursilvan
Training Data	Common Voice rm-sursilv
Metrics	WER (Word Error Rate)
Tags	audio, automatic-speech-recognition, speech, xlsr-fine-tuning-week
Model Index	Name: Wav2Vec2 Large 53 Romansh Sursilvan by Gunjan Chhablani, Results: Speech Recognition on Common Voice rm-sursilv with Test WER of 25.16

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご