The open-source wav2vec2-large-xlsr-53-italian model - Accurately achieve automatic Italian speech recognition

Wav2vec2 Large Xlsr 53 Italian

Developed by jonatasgrosman

An Italian automatic speech recognition model fine-tuned from the facebook/wav2vec2-large-xlsr-53 model, trained on the Common Voice 6.1 dataset

Speech Recognition OtherOpen Source License:Apache-2.0 #Italian speech recognition #Low word error rate #XLSR fine-tuning

Downloads 1,012

Release Time : 3/2/2022

Model Overview

This model is an automatic speech recognition (ASR) model optimized for Italian, fine-tuned based on the XLSR-53 architecture, supporting speech input conversion at 16kHz sampling rate

Model Features

High-performance Italian recognition

Achieves a word error rate (WER) of 9.41% and a character error rate (CER) of 2.29% on the Common Voice Italian test set

Language model enhancement

When combined with a language model, the word error rate can be further reduced to 6.91% and the character error rate to 1.83%

Multi-scenario applicability

Performs well on both standard test sets and robust speech competition development sets, demonstrating strong generalization capabilities

Easy integration

Provides two usage methods: the HuggingSound library and custom scripts, facilitating quick integration into applications

Model Capabilities

Italian speech-to-text

16kHz audio processing

Batch speech recognition

Long audio chunk processing

Use Cases

Speech transcription

Italian speech content transcription

Convert Italian speech content into text format

Highly accurate transcription results, suitable for content archiving and analysis

Voice assistants

Italian voice command recognition

Used for command recognition in Italian voice assistant systems

Low-latency, high-accuracy command recognition

Accessibility applications

Speech-to-text assistance

Provides real-time speech-to-text services for hearing-impaired individuals

Highly accurate real-time conversion

🚀 XLSR Wav2Vec2 Italian by Jonatas Grosman

This is a fine - tuned XLSR - 53 large model for Italian speech recognition, offering high - quality automatic speech recognition capabilities.

🚀 Quick Start

This model is a fine - tuned version of facebook/wav2vec2-large-xlsr-53 on Italian, using the train and validation splits of Common Voice 6.1. When using this model, ensure that your speech input is sampled at 16kHz.

This model was fine - tuned thanks to the GPU credits generously provided by OVHcloud :)

The training script can be found here: https://github.com/jonatasgrosman/wav2vec2-sprint

✨ Features

Multilingual Adaptability: Based on the XLSR - 53 large model, it can be well - adapted to Italian speech recognition.
High - Quality Results: Achieves low WER and CER on the test set, with additional improvements when using a language model.

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

Using the HuggingSound library:

from huggingsound import SpeechRecognitionModel

model = SpeechRecognitionModel("jonatasgrosman/wav2vec2-large-xlsr-53-italian")
audio_paths = ["/path/to/file.mp3", "/path/to/another_file.wav"]

transcriptions = model.transcribe(audio_paths)

Advanced Usage

Writing your own inference script:

import torch
import librosa
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

LANG_ID = "it"
MODEL_ID = "jonatasgrosman/wav2vec2-large-xlsr-53-italian"
SAMPLES = 10

test_dataset = load_dataset("common_voice", LANG_ID, split=f"test[:{SAMPLES}]")

processor = Wav2Vec2Processor.from_pretrained(MODEL_ID)
model = Wav2Vec2ForCTC.from_pretrained(MODEL_ID)

# Preprocessing the datasets.
# We need to read the audio files as arrays
def speech_file_to_array_fn(batch):
    speech_array, sampling_rate = librosa.load(batch["path"], sr=16_000)
    batch["speech"] = speech_array
    batch["sentence"] = batch["sentence"].upper()
    return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)
inputs = processor(test_dataset["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)

with torch.no_grad():
    logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits

predicted_ids = torch.argmax(logits, dim=-1)
predicted_sentences = processor.batch_decode(predicted_ids)

for i, predicted_sentence in enumerate(predicted_sentences):
    print("-" * 100)
    print("Reference:", test_dataset[i]["sentence"])
    print("Prediction:", predicted_sentence)

Here is a comparison table of reference and prediction results:

Reference	Prediction
POI LEI MORÌ.	POI LEI MORÌ
IL LIBRO HA SUSCITATO MOLTE POLEMICHE A CAUSA DEI SUOI CONTENUTI.	IL LIBRO HA SUSCITATO MOLTE POLEMICHE A CAUSA DEI SUOI CONTENUTI
"FIN DALL'INIZIO LA SEDE EPISCOPALE È STATA IMMEDIATAMENTE SOGGETTA ALLA SANTA SEDE."	FIN DALL'INIZIO LA SEDE EPISCOPALE È STATA IMMEDIATAMENTE SOGGETTA ALLA SANTA SEDE
IL VUOTO ASSOLUTO?	IL VUOTO ASSOLUTO
DOPO ALCUNI ANNI, EGLI DECISE DI TORNARE IN INDIA PER RACCOGLIERE ALTRI INSEGNAMENTI.	DOPO ALCUNI ANNI EGLI DECISE DI TORNARE IN INDIA PER RACCOGLIERE ALTRI INSEGNAMENTI
SALVATION SUE	SALVATION SOO
IN QUESTO MODO, DECIO OTTENNE IL POTERE IMPERIALE.	IN QUESTO MODO DECHO OTTENNE IL POTERE IMPERIALE
SPARTA NOVARA ACQUISISCE IL TITOLO SPORTIVO PER GIOCARE IN PRIMA CATEGORIA.	PARCANOVARACFILISCE IL TITOLO SPORTIVO PER GIOCARE IN PRIMA CATEGORIA
IN SEGUITO, KYGO E SHEAR HANNO PROPOSTO DI CONTINUARE A LAVORARE SULLA CANZONE.	IN SEGUITO KIGO E SHIAR HANNO PROPOSTO DI CONTINUARE A LAVORARE SULLA CANZONE
ALAN CLARKE	ALAN CLARK

📚 Documentation

Evaluation

To evaluate on mozilla-foundation/common_voice_6_0 with split test

python eval.py --model_id jonatasgrosman/wav2vec2-large-xlsr-53-italian --dataset mozilla-foundation/common_voice_6_0 --config it --split test

To evaluate on speech-recognition-community-v2/dev_data

python eval.py --model_id jonatasgrosman/wav2vec2-large-xlsr-53-italian --dataset speech-recognition-community-v2/dev_data --config it --split validation --chunk_length_s 5.0 --stride_length_s 1.0

Model Information

Property	Details
Model Type	Fine - tuned XLSR - 53 large model for Italian speech recognition
Training Data	Common Voice 6.1 (train and validation splits for Italian)
Metrics	WER, CER
Tags	audio, automatic - speech - recognition, hf - asr - leaderboard, it, mozilla - foundation/common_voice_6_0, robust - speech - event, speech, xlsr - fine - tuning - week

📄 License

This model is licensed under the Apache 2.0 license.

📚 Citation

If you want to cite this model, you can use the following BibTeX entry:

@misc{grosman2021xlsr53-large-italian,
  title={Fine-tuned {XLSR}-53 large model for speech recognition in {I}talian},
  author={Grosman, Jonatas},
  howpublished={\url{https://huggingface.co/jonatasgrosman/wav2vec2-large-xlsr-53-italian}},
  year={2021}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご