wav2vec2-xls-r-parlaspeech-hr-lm Open-source Model - Free Deployment for Automatic Speech Recognition of Croatian

Wav2vec2 Xls R Parlaspeech Hr Lm

Developed by classla

A Croatian automatic speech recognition model fine-tuned from facebook/wav2vec2-xls-r-300m, trained on the ParlaSpeech-HR v1.0 dataset

Speech Recognition

Transformers

Other#Croatian speech recognition #Parliamentary speech transcription #Low word error rate

Downloads 18

Release Time : 4/28/2022

Model Overview

This model is an automatic speech recognition system for Croatian, based on the wav2vec2-xls-r architecture, specifically designed to convert Croatian speech into text

Model Features

High-precision recognition

Achieves 3.63% character error rate and 9.85% word error rate on the ParlaSpeech-HR test set

Parliamentary speech optimization

Specially trained and optimized for Croatian parliamentary speech data

Language model enhancement

Incorporates a language model (LM) for decoding to improve recognition accuracy

Model Capabilities

Croatian speech recognition

Parliamentary speech transcription

Real-time speech-to-text

Use Cases

Government agencies

Parliament meeting minutes

Automatically transcribe Croatian parliamentary meeting content

Improves meeting minute efficiency and reduces manual transcription costs

Speech transcription services

Croatian speech transcription

Provides speech-to-text services for Croatian speakers

Achieves over 90% word recognition accuracy

🚀 wav2vec2-xls-r-parlaspeech-hr-lm

This model is designed for Croatian Automatic Speech Recognition (ASR). It's based on the facebook/wav2vec2-xls-r-300m model and fine - tuned using 300 hours of recordings and transcripts from the ASR Croatian parliament dataset ParlaSpeech - HR v1.0.

If you use this model, please cite the following paper:

@inproceedings{ljubevsic2022parlaspeech,
  title={ParlaSpeech-HR-a Freely Available ASR Dataset for Croatian Bootstrapped from the ParlaMint Corpus},
  author={Ljube{\v{s}}i{\'c}, Nikola and Kor{\v{z}}inek, Danijel and Rupnik, Peter and Jazbec, Ivo-Pavao},
  booktitle={Proceedings of the Workshop ParlaCLARIN III within the 13th Language Resources and Evaluation Conference},
  pages={111--116},
  year={2022},
  url={http://www.lrec-conf.org/proceedings/lrec2022/workshops/ParlaCLARINIII/pdf/2022.parlaclariniii-1.16.pdf}
}

🚀 Quick Start

This model can be used for Croatian ASR tasks. You can refer to the following sections for more details on its metrics, usage, and training hyperparameters.

✨ Features

Based on the well - known facebook/wav2vec2-xls-r-300m model.
Fine - tuned with 300 hours of high - quality Croatian parliament data from ParlaSpeech - HR v1.0.

📚 Documentation

🔍 Metrics

Evaluation is performed on the dev and test portions of the ParlaSpeech - HR v1.0 dataset.

Property	Details
Model Type	Croatian ASR model based on wav2vec2 - xls - r - 300m
Training Data	300 hours of recordings and transcripts from ParlaSpeech - HR v1.0

Split	CER	WER
dev	0.0448	0.1129
test	0.0363	0.0985

There are multiple models available, and in terms of CER and WER, the best - performing model is wav2vec2-large-slavic-parlaspeech-hr-lm.

💻 Usage Examples

Basic Usage

# Tested with `transformers==4.18.0`, `torch==1.11.0`, and `SoundFile==0.10.3.post1`.
from transformers import Wav2Vec2ProcessorWithLM, Wav2Vec2ForCTC
import soundfile as sf
import torch
import os
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
# load model and tokenizer
processor = Wav2Vec2ProcessorWithLM.from_pretrained(
    "classla/wav2vec2-xls-r-parlaspeech-hr-lm")
model = Wav2Vec2ForCTC.from_pretrained("classla/wav2vec2-xls-r-parlaspeech-hr-lm")
# download the example wav files:
os.system("wget https://huggingface.co/classla/wav2vec2-large-slavic-parlaspeech-hr/raw/main/00020570a.flac.wav")
# read the wav file 
speech, sample_rate = sf.read("00020570a.flac.wav")
input_values = processor(speech, sampling_rate=sample_rate, return_tensors="pt").input_values.cuda()
inputs = processor(speech, sampling_rate=sample_rate, return_tensors="pt")
with torch.no_grad():
    logits = model(**inputs).logits
transcription = processor.batch_decode(logits.numpy()).text[0]

# remove the raw wav file
os.system("rm 00020570a.flac.wav")
transcription

# transcription: 'velik broj poslovnih subjekata posluje sa minusom velik dio'

🔧 Technical Details

Training hyperparameters

In fine - tuning, the following arguments were used:

Argument	Value
`per_device_train_batch_size`	16
`gradient_accumulation_steps`	4
`num_train_epochs`	8
`learning_rate`	3e - 4
`warmup_steps`	500

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご