The open-source model wav2vec2-large-xlsr-hu - Achieve free automatic speech recognition for Hungarian

Wav2vec2 Large Xlsr Hu

Developed by gchhablani

This is a Hungarian automatic speech recognition (ASR) model fine-tuned from the facebook/wav2vec2-large-xlsr-53 model, trained using the Common Voice dataset.

Speech Recognition OtherOpen Source License:Apache-2.0 #Hungarian speech recognition #XLSR fine-tuned model #Low-resource language processing

Downloads 25

Release Time : 3/2/2022

Model Overview

This model is specifically designed for Hungarian speech recognition tasks, capable of converting Hungarian speech into text.

Model Features

Hungarian-specific

Speech recognition model optimized specifically for Hungarian

Based on XLSR-53 architecture

Fine-tuned using the powerful wav2vec2-large-xlsr-53 base model

16kHz sampling rate support

Supports processing of speech input at 16kHz sampling rate

Model Capabilities

Hungarian speech recognition

Speech-to-text

Use Cases

Speech transcription

Hungarian speech transcription

Convert Hungarian speech content into text

Word Error Rate 46.75%

Voice assistants

Hungarian voice command recognition

Used for voice command recognition in Hungarian voice assistants or control systems

🚀 Wav2Vec2-Large-XLSR-53-Hungarian

This model is fine - tuned from facebook/wav2vec2-large-xlsr-53 on Hungarian using the Common Voice dataset. It's designed for speech recognition tasks, offering a solution for Hungarian language processing.

🚀 Quick Start

When using this model, make sure that your speech input is sampled at 16kHz.

✨ Features

Fine - Tuned: Based on the large - scale pre - trained model facebook/wav2vec2-large-xlsr-53 and fine - tuned on the Hungarian language.
Data Source: Trained on the Common Voice dataset, which provides a rich source of real - world speech data.

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

The model can be used directly (without a language model) as follows:

import torch
import torchaudio
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

test_dataset = load_dataset("common_voice", "hu", split="test[:2%]")

processor = Wav2Vec2Processor.from_pretrained("gchhablani/wav2vec2-large-xlsr-hu")
model = Wav2Vec2ForCTC.from_pretrained("gchhablani/wav2vec2-large-xlsr-hu")

resampler = torchaudio.transforms.Resample(48_000, 16_000)

# Preprocessing the datasets.
# We need to read the aduio files as arrays
def speech_file_to_array_fn(batch):
    speech_array, sampling_rate = torchaudio.load(batch["path"])
    batch["speech"] = resampler(speech_array).squeeze().numpy()
    return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)
inputs = processor(test_dataset["speech"][:2], sampling_rate=16_000, return_tensors="pt", padding=True)

with torch.no_grad():
    logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits

predicted_ids = torch.argmax(logits, dim=-1)

print("Prediction:", processor.batch_decode(predicted_ids))
print("Reference:", test_dataset["sentence"][:2])

Advanced Usage

The model can be evaluated as follows on the Portuguese test data of Common Voice.

import torch
import torchaudio
from datasets import load_dataset, load_metric
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import re

test_dataset = load_dataset("common_voice", "hu", split="test")
wer = load_metric("wer")

processor = Wav2Vec2Processor.from_pretrained("gchhablani/wav2vec2-large-xlsr-hu")
model = Wav2Vec2ForCTC.from_pretrained("gchhablani/wav2vec2-large-xlsr-hu")
model.to("cuda")

chars_to_ignore_regex = '[\,\?\.\!\-\;\:\"\“\%\‘\”\�\–\…]'
resampler = torchaudio.transforms.Resample(48_000, 16_000)

# Preprocessing the datasets.
# We need to read the aduio files as arrays
def speech_file_to_array_fn(batch):
    batch["sentence"] = re.sub(chars_to_ignore_regex, '', batch["sentence"]).lower()
    speech_array, sampling_rate = torchaudio.load(batch["path"])
    batch["speech"] = resampler(speech_array).squeeze().numpy()
    return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)

# Preprocessing the datasets.
# We need to read the aduio files as arrays
def evaluate(batch):
    inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)

    with torch.no_grad():
        logits = model(inputs.input_values.to("cuda"), attention_mask=inputs.attention_mask.to("cuda")).logits

        pred_ids = torch.argmax(logits, dim=-1)
        batch["pred_strings"] = processor.batch_decode(pred_ids)
    return batch

result = test_dataset.map(evaluate, batched=True, batch_size=8)

print("WER: {:2f}".format(100 * wer.compute(predictions=result["pred_strings"], references=result["sentence"])))

Test Result: 46.75 %

📚 Documentation

Training

The Common Voice train and validation datasets were used for training. The code can be found here. The notebook containing the code used for evaluation can be found here.

📄 License

This model is licensed under the apache - 2.0 license.

Additional Information

Property	Details
Model Type	Fine - tuned Wav2Vec2 Large 53 for Hungarian
Training Data	Common Voice (train and validation datasets)
Metrics	Word Error Rate (WER)
Tags	audio, automatic - speech - recognition, speech, xlsr - fine - tuning - week
Model Name	Wav2Vec2 Large 53 Hungarian by Gunjan Chhablani
Task	Speech Recognition
Dataset for Results	Common Voice hu
Test WER	46.75

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご