The open-source wav2vec2-large-xlsr-53-hungarian model - Free support for Hungarian Automatic Speech Recognition

Wav2vec2 Large Xlsr 53 Hungarian

Developed by anton-l

This is a Hungarian automatic speech recognition model fine-tuned from the facebook/wav2vec2-large-xlsr-53 model, trained using the Common Voice dataset.

Speech Recognition OtherOpen Source License:Apache-2.0 #Hungarian speech recognition #XLSR fine-tuning #Low-resource speech processing

Downloads 17

Release Time : 3/2/2022

Model Overview

This model is specifically designed for Hungarian automatic speech recognition tasks, capable of converting Hungarian speech into text.

Model Features

Hungarian-specific

Speech recognition model optimized specifically for Hungarian language

Based on XLSR-53

Fine-tuned from the powerful cross-lingual speech representation model wav2vec2-large-xlsr-53

16kHz sampling rate support

Supports speech input with 16kHz sampling rate

Model Capabilities

Hungarian speech recognition

Speech-to-text

Use Cases

Speech transcription

Hungarian speech transcription

Convert Hungarian speech content into text

Word Error Rate 42.26%

🚀 Wav2Vec2-Large-XLSR-53-Hungarian

This model is fine - tuned from facebook/wav2vec2-large-xlsr-53 on Hungarian using the Common Voice dataset. Ensure your speech input is sampled at 16kHz when using it.

📋 Information Table

Property	Details
Language	Hungarian
Datasets	common_voice
Metrics	wer
Tags	audio, automatic - speech - recognition, speech, xlsr - fine - tuning - week
License	apache - 2.0
Model Name	Hungarian XLSR Wav2Vec2 Large 53 by Anton Lozhkov
Task	Speech Recognition (automatic - speech - recognition)
Dataset	Common Voice hu (common_voice, args: hu)
Test WER	42.26

🚀 Quick Start

This model is a fine - tuned version of facebook/wav2vec2-large-xlsr-53 on the Hungarian language, utilizing the Common Voice dataset. When using this model, make sure your speech input is sampled at 16kHz.

✨ Features

Fine - tuned on the Hungarian language with the Common Voice dataset.
Can be used for automatic speech recognition tasks.

💻 Usage Examples

Basic Usage

The model can be used directly (without a language model) as follows:

import torch
import torchaudio
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

test_dataset = load_dataset("common_voice", "hu", split="test[:2%]")

processor = Wav2Vec2Processor.from_pretrained("anton-l/wav2vec2-large-xlsr-53-hungarian")
model = Wav2Vec2ForCTC.from_pretrained("anton-l/wav2vec2-large-xlsr-53-hungarian")

resampler = torchaudio.transforms.Resample(48_000, 16_000)

# Preprocessing the datasets.
# We need to read the audio files as arrays
def speech_file_to_array_fn(batch):
    speech_array, sampling_rate = torchaudio.load(batch["path"])
    batch["speech"] = resampler(speech_array).squeeze().numpy()
    return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)
inputs = processor(test_dataset["speech"][:2], sampling_rate=16_000, return_tensors="pt", padding=True)

with torch.no_grad():
    logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits

predicted_ids = torch.argmax(logits, dim=-1)

print("Prediction:", processor.batch_decode(predicted_ids))
print("Reference:", test_dataset["sentence"][:2])

Advanced Usage

The model can be evaluated as follows on the Hungarian test data of Common Voice.

import torch
import torchaudio
import urllib.request
import tarfile
import pandas as pd
from tqdm.auto import tqdm
from datasets import load_metric
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

# Download the raw data instead of using HF datasets to save disk space 
data_url = "https://voice-prod-bundler-ee1969a6ce8178826482b88e843c335139bd3fb4.s3.amazonaws.com/cv-corpus-6.1-2020-12-11/hu.tar.gz"
filestream = urllib.request.urlopen(data_url)
data_file = tarfile.open(fileobj=filestream, mode="r|gz")
data_file.extractall()

wer = load_metric("wer")

processor = Wav2Vec2Processor.from_pretrained("anton-l/wav2vec2-large-xlsr-53-hungarian")
model = Wav2Vec2ForCTC.from_pretrained("anton-l/wav2vec2-large-xlsr-53-hungarian")
model.to("cuda")

cv_test = pd.read_csv("cv-corpus-6.1-2020-12-11/hu/test.tsv", sep='\t')
clips_path = "cv-corpus-6.1-2020-12-11/hu/clips/"

def clean_sentence(sent):
    sent = sent.lower()
    # replace non-alpha characters with space
    sent = "".join(ch if ch.isalpha() else " " for ch in sent)
    # remove repeated spaces
    sent = " ".join(sent.split())
    return sent

targets = []
preds = []

for i, row in tqdm(cv_test.iterrows(), total=cv_test.shape[0]):
    row["sentence"] = clean_sentence(row["sentence"])
    speech_array, sampling_rate = torchaudio.load(clips_path + row["path"])
    resampler = torchaudio.transforms.Resample(sampling_rate, 16_000)
    row["speech"] = resampler(speech_array).squeeze().numpy()

    inputs = processor(row["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)

    with torch.no_grad():
        logits = model(inputs.input_values.to("cuda"), attention_mask=inputs.attention_mask.to("cuda")).logits

    pred_ids = torch.argmax(logits, dim=-1)

    targets.append(row["sentence"])
    preds.append(processor.batch_decode(pred_ids)[0])

print("WER: {:2f}".format(100 * wer.compute(predictions=preds, references=targets)))

Test Result: 42.26 %

📚 Documentation

Training

The Common Voice train and validation datasets were used for training.

📄 License

This model is released under the apache - 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご