wav2vec2-large-xlsr-53-italian Open Source Model - Free and Accurate Automatic Italian Speech Recognition

Wav2vec2 Large Xlsr 53 Italian

Developed by facebook

Large-scale Italian automatic speech recognition model based on the Wav2Vec2 architecture, fine-tuned on the Common Voice dataset, released by Facebook

Speech Recognition OtherOpen Source License:Apache-2.0 #Italian Speech Recognition #High-precision ASR #Multi-dialect Adaptation

Downloads 4,013

Release Time : 3/2/2022

Model Overview

This model is an automatic speech recognition (ASR) system based on the Wav2Vec2 architecture, specifically optimized for Italian, capable of converting Italian audio into text

Model Features

Large-scale Pretraining

Based on the XLSR-53 large-scale multilingual speech representation learning model

Italian Language Optimization

Specifically fine-tuned for Italian to improve recognition accuracy

Efficient Speech Processing

Supports 16kHz sample rate audio input, suitable for common speech application scenarios

Model Capabilities

Italian audio-to-text conversion

Speech recognition

Speech transcription

Use Cases

Speech Transcription

Italian Meeting Minutes

Automatically convert Italian meeting recordings into written transcripts

22.1% WER on the Common Voice test set

Voice Assistants

Provide speech recognition capabilities for Italian voice assistants

Accessibility Applications

Real-time Caption Generation

Generate real-time captions for Italian video content

🚀 Speech Recognition Model Evaluation

This project focuses on evaluating a speech recognition model on the Italian Common Voice dataset. It uses the Wav2Vec2 architecture for Automatic Speech Recognition (ASR) and provides a Python script to calculate the Word Error Rate (WER).

🚀 Quick Start

Prerequisites

Make sure you have the necessary libraries installed. You can install them using pip:

pip install torchaudio datasets transformers torch

Run the Evaluation Script

The following Python script evaluates the model on the Italian Common Voice test dataset and calculates the WER.

import torchaudio
from datasets import load_dataset, load_metric
from transformers import (
    Wav2Vec2ForCTC,
    Wav2Vec2Processor,
)
import torch
import re
import sys

model_name = "facebook/wav2vec2-large-xlsr-53-italian"
device = "cuda"

chars_to_ignore_regex = '[\,\?\.\!\-\;\:\"]'  # noqa: W605

model = Wav2Vec2ForCTC.from_pretrained(model_name).to(device)
processor = Wav2Vec2Processor.from_pretrained(model_name)

ds = load_dataset("common_voice", "it", split="test", data_dir="./cv-corpus-6.1-2020-12-11")

resampler = torchaudio.transforms.Resample(orig_freq=48_000, new_freq=16_000)

def map_to_array(batch):
    speech, _ = torchaudio.load(batch["path"])
    batch["speech"] = resampler.forward(speech.squeeze(0)).numpy()
    batch["sampling_rate"] = resampler.new_freq
    batch["sentence"] = re.sub(chars_to_ignore_regex, '', batch["sentence"]).lower().replace("’", "'")
    return batch
    
ds = ds.map(map_to_array)

def map_to_pred(batch):
    features = processor(batch["speech"], sampling_rate=batch["sampling_rate"][0], padding=True, return_tensors="pt")
    input_values = features.input_values.to(device)
    attention_mask = features.attention_mask.to(device)
    with torch.no_grad():
        logits = model(input_values, attention_mask=attention_mask).logits
    pred_ids = torch.argmax(logits, dim=-1)
    batch["predicted"] = processor.batch_decode(pred_ids)
    batch["target"] = batch["sentence"]
    return batch
    
result = ds.map(map_to_pred, batched=True, batch_size=16, remove_columns=list(ds.features.keys()))

wer = load_metric("wer")

print(wer.compute(predictions=result["predicted"], references=result["target"]))

Result

The Word Error Rate (WER) of the model on the Italian Common Voice test dataset is 22.1 %.

✨ Features

Speech Recognition: Utilizes the Wav2Vec2 architecture for automatic speech recognition.
Dataset Integration: Works with the Italian Common Voice dataset.
Evaluation Metric: Calculates the Word Error Rate (WER) to measure the performance of the model.

📦 Installation

To install the required libraries, run the following command:

pip install torchaudio datasets transformers torch

💻 Usage Examples

Basic Usage

The provided Python script is a complete example of evaluating the model on the Italian Common Voice test dataset. You can run it directly after installing the necessary libraries.

Advanced Usage

You can modify the script to use different models or datasets. For example, you can change the model_name variable to use a different pre-trained model or change the load_dataset parameters to use a different dataset.

# Example of using a different model
model_name = "another_model_name"
model = Wav2Vec2ForCTC.from_pretrained(model_name).to(device)
processor = Wav2Vec2Processor.from_pretrained(model_name)

# Example of using a different dataset
ds = load_dataset("another_dataset", "another_language", split="test", data_dir="./another_dataset_dir")

📚 Documentation

Wav2Vec2ForCTC: A model for Connectionist Temporal Classification (CTC) based on the Wav2Vec2 architecture.
Wav2Vec2Processor: A processor that can be used to preprocess audio data and decode model outputs.
load_dataset: A function from the datasets library to load a dataset.
load_metric: A function from the datasets library to load an evaluation metric.

📄 License

This project is licensed under the Apache 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご