Open-source Automatic Speech Recognition Model wav2vec2-large-xlsr-mongolian - Free Deployment to Boost Mongolian Speech Recognition

Wav2vec2 Large Xlsr Mongolian

Developed by manandey

An automatic speech recognition model fine-tuned on the Mongolian Common Voice dataset based on facebook/wav2vec2-large-xlsr-53

Speech Recognition OtherOpen Source License:Apache-2.0 #Mongolian speech recognition #XLSR fine-tuning #Low-resource language processing

Downloads 4,719

Release Time : 3/2/2022

Model Overview

This model is an automatic speech recognition (ASR) model optimized for Mongolian, based on the Wav2Vec2 architecture, suitable for converting Mongolian speech to text.

Model Features

Mongolian optimization

Specifically fine-tuned for Mongolian speech recognition, enhancing comprehension of Mongolian speech.

XLSR pre-training based

Fine-tuned from the facebook/wav2vec2-large-xlsr-53 model, leveraging cross-lingual speech representation learning.

16kHz sampling rate support

Supports speech input at 16kHz sampling rate, suitable for most speech application scenarios.

Model Capabilities

Mongolian speech recognition

Speech-to-text

Use Cases

Speech transcription

Mongolian speech transcription

Convert Mongolian speech content into editable text format

Achieved a WER of 43.08% on the Common Voice Mongolian test set

Voice assistants

Mongolian voice command recognition

Used for developing Mongolian-language voice assistants and voice control applications

🚀 Wav2Vec2-Large-XLSR-53-Mongolian

This model is a fine - tuned version of facebook/wav2vec2-large-xlsr-53 in Mongolian, utilizing the Common Voice dataset. Ensure your speech input is sampled at 16kHz when using this model.

📦 Information Table

Property	Details
Language	Mongolian
Datasets	Common Voice
Tags	audio, automatic - speech - recognition, speech, xlsr - fine - tuning - week
License	apache - 2.0
Model Name	XLSR Wav2Vec2 Mongolian by Manan Dey
Task	Speech Recognition (automatic - speech - recognition)
Dataset	Common Voice mn
Test WER	43.08

🚀 Quick Start

This model is a fine - tuned version of facebook/wav2vec2-large-xlsr-53 in Mongolian, using the Common Voice dataset. When using this model, ensure that your speech input is sampled at 16kHz.

💻 Usage Examples

Basic Usage

import torch
import torchaudio
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

test_dataset = load_dataset("common_voice", "mn", split="test[:2%]")

processor = Wav2Vec2Processor.from_pretrained("manandey/wav2vec2-large-xlsr-mongolian")
model = Wav2Vec2ForCTC.from_pretrained("manandey/wav2vec2-large-xlsr-mongolian")

resampler = torchaudio.transforms.Resample(48_000, 16_000)

# Preprocessing the datasets.
# We need to read the aduio files as arrays
def speech_file_to_array_fn(batch):
  speech_array, sampling_rate = torchaudio.load(batch["path"])
  batch["speech"] = resampler(speech_array).squeeze().numpy()
  return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)
inputs = processor(test_dataset["speech"][:2], sampling_rate=16_000, return_tensors="pt", padding=True)

with torch.no_grad():
  logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits

predicted_ids = torch.argmax(logits, dim=-1)

print("Prediction:", processor.batch_decode(predicted_ids))
print("Reference:", test_dataset["sentence"][:2])

📚 Documentation

Evaluation

The model can be evaluated as follows on the Mongolian test data of Common Voice.

import torch
import torchaudio
from datasets import load_dataset, load_metric
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import re

test_dataset = load_dataset("common_voice", "mn", split="test")
wer = load_metric("wer")

processor = Wav2Vec2Processor.from_pretrained("manandey/wav2vec2-large-xlsr-mongolian")
model = Wav2Vec2ForCTC.from_pretrained("manandey/wav2vec2-large-xlsr-mongolian")
model.to("cuda")

chars_to_ignore_regex = '[\,\?\.\!\-\;\:\"\“\%\‘\”\�\’\–\(\)]'
resampler = torchaudio.transforms.Resample(48_000, 16_000)

# Preprocessing the datasets.
# We need to read the aduio files as arrays
def speech_file_to_array_fn(batch):
    batch["sentence"] = re.sub(chars_to_ignore_regex, '', batch["sentence"]).lower()
    speech_array, sampling_rate = torchaudio.load(batch["path"])
    batch["speech"] = resampler(speech_array).squeeze().numpy()
    return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)

# Preprocessing the datasets.
# We need to read the aduio files as arrays
def evaluate(batch):
    inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)

    with torch.no_grad():
        logits = model(inputs.input_values.to("cuda"), attention_mask=inputs.attention_mask.to("cuda")).logits

    pred_ids = torch.argmax(logits, dim=-1)
    batch["pred_strings"] = processor.batch_decode(pred_ids)
    return batch

result = test_dataset.map(evaluate, batched=True, batch_size=8)

print("WER: {:2f}".format(100 * wer.compute(predictions=result["pred_strings"], references=result["sentence"])))

Test Result: 43.08%

Training

The Common Voice train and validation datasets were used for training.

📄 License

This model is licensed under the apache - 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご