The xlsr-en-punctuation open-source speech recognition model enables free English speech recognition and punctuation prediction.

Xlsr En Punctuation

Developed by boris

Fine-tuned automatic speech recognition model based on facebook/wav2vec2-large-xlsr-53 on the English Common Voice dataset, supporting punctuation prediction

Speech Recognition EnglishOpen Source License:Apache-2.0 #English speech recognition #XLSR-53 pretrained #Low word error rate

Downloads 30.28k

Release Time : 3/2/2022

Model Overview

This is a Wav2Vec2 model for English automatic speech recognition (ASR) that can convert speech to text and automatically add punctuation.

Model Features

Multilingual pretraining

Fine-tuned from the XLSR-53 multilingual model with strong cross-lingual representation capabilities

Punctuation prediction

Not only recognizes speech content but also automatically predicts and adds punctuation

High accuracy

Achieves 1.0% word error rate (WER) on the Common Voice English test set

Model Capabilities

English speech recognition

Automatic punctuation prediction

16kHz audio processing

Use Cases

Speech transcription

Automatic meeting minutes generation

Automatically converts meeting recordings into punctuated transcripts

High accuracy reduces manual proofreading workload

Podcast subtitle generation

Automatically generates punctuated subtitle files for English podcasts

Supports output in common subtitle formats like SRT

Assistive technology

Voice input system

Provides high-accuracy voice input solutions for people with disabilities

Improves input efficiency and accuracy

🚀 Wav2Vec2-Large-XLSR-53-English

This model is fine-tuned from facebook/wav2vec2-large-xlsr-53 on {language} using the Common Voice. It's designed for automatic speech recognition tasks, and requires speech input sampled at 16kHz.

🚀 Quick Start

This model is fine-tuned facebook/wav2vec2-large-xlsr-53 on {language} using the Common Voice. When using this model, make sure that your speech input is sampled at 16kHz.

✨ Features

Dataset: Utilizes the Common Voice dataset for training.
Metric: Evaluated using Word Error Rate (WER).
Task: Specialized for automatic speech recognition.

Property	Details
Model Type	English XLSR Wav2Vec2 Large 53 with punctuation
Training Data	Common Voice

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

import torch
import torchaudio
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

test_dataset = load_dataset("common_voice", "{lang_id}", split="test[:2%]") #TODO: replace {lang_id} in your language code here. Make sure the code is one of the *ISO codes* of [this](https://huggingface.co/languages) site.

processor = Wav2Vec2Processor.from_pretrained("{model_id}") #TODO: replace {model_id} with your model id. The model id consists of {your_username}/{your_modelname}, *e.g.* `elgeish/wav2vec2-large-xlsr-53-arabic`
model = Wav2Vec2ForCTC.from_pretrained("{model_id}") #TODO: replace {model_id} with your model id. The model id consists of {your_username}/{your_modelname}, *e.g.* `elgeish/wav2vec2-large-xlsr-53-arabic`

resampler = torchaudio.transforms.Resample(48_000, 16_000)

# Preprocessing the datasets.
# We need to read the aduio files as arrays
def speech_file_to_array_fn(batch):
    speech_array, sampling_rate = torchaudio.load(batch["path"])
    batch["speech"] = resampler(speech_array).squeeze().numpy()
    return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)
inputs = processor(test_dataset[:2]["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)

with torch.no_grad():
    logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits

predicted_ids = torch.argmax(logits, dim=-1)

print("Prediction:", processor.batch_decode(predicted_ids))
print("Reference:", test_dataset[:2]["sentence"])

Advanced Usage

import torch
import torchaudio
from datasets import load_dataset, load_metric
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import re

test_dataset = load_dataset("common_voice", "{lang_id}", split="test") #TODO: replace {lang_id} in your language code here. Make sure the code is one of the *ISO codes* of [this](https://huggingface.co/languages) site.
wer = load_metric("wer")

processor = Wav2Vec2Processor.from_pretrained("{model_id}") #TODO: replace {model_id} with your model id. The model id consists of {your_username}/{your_modelname}, *e.g.* `elgeish/wav2vec2-large-xlsr-53-arabic`
model = Wav2Vec2ForCTC.from_pretrained("{model_id}") #TODO: replace {model_id} with your model id. The model id consists of {your_username}/{your_modelname}, *e.g.* `elgeish/wav2vec2-large-xlsr-53-arabic`
model.to("cuda")

chars_to_ignore_regex = '[\\,\\?\\.\\!\\-\\;\\:\\"\\“]'  # TODO: adapt this list to include all special characters you removed from the data
resampler = torchaudio.transforms.Resample(48_000, 16_000)

# Preprocessing the datasets.
# We need to read the aduio files as arrays
def speech_file_to_array_fn(batch):
    batch["sentence"] = re.sub(chars_to_ignore_regex, '', batch["sentence"]).lower()
    speech_array, sampling_rate = torchaudio.load(batch["path"])
    batch["speech"] = resampler(speech_array).squeeze().numpy()
    return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)

# Preprocessing the datasets.
# We need to read the aduio files as arrays
def evaluate(batch):
    inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)

    with torch.no_grad():
        logits = model(inputs.input_values.to("cuda"), attention_mask=inputs.attention_mask.to("cuda")).logits

    pred_ids = torch.argmax(logits, dim=-1)
    batch["pred_strings"] = processor.batch_decode(pred_ids)
    return batch

result = test_dataset.map(evaluate, batched=True, batch_size=8)

print("WER: {:2f}".format(100 * wer.compute(predictions=result["pred_strings"], references=result["sentence"])))

⚠️ Important Note

When using this model, make sure that your speech input is sampled at 16kHz.

💡 Usage Tip

Remember to replace the placeholders such as {lang_id} and {model_id} with your actual values.

📚 Documentation

Test Result

Test Result: XX.XX % # TODO: write output of print here. IMPORTANT: Please remember to also replace {wer_result_on_test} at the top of with this value here. tags.

Training

The Common Voice train, validation, and ... datasets were used for training as well as ... and ... # TODO: adapt to state all the datasets that were used for training.

The script used for training can be found here # TODO: fill in a link to your training script here. If you trained your model in a colab, simply fill in the link here. If you trained the model locally, it would be great if you could upload the training script on github and paste the link here.

📄 License

This model is licensed under the apache-2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご