wav2vec2-conformer-rel-pos-large-960h-ft Open-source Speech Model - Supports Precise Speech Recognition of 16kHz Audio

Wav2vec2 Conformer Rel Pos Large 960h Ft

Developed by facebook

A Wav2Vec2-Conformer model based on 16kHz sampled speech audio, using relative positional embedding technology, pre-trained and fine-tuned on 960 hours of Librispeech data

Speech Recognition

Transformers

EnglishOpen Source License:Apache-2.0 #High-precision speech recognition #Relative positional embedding #English speech processing

Downloads 1,038

Release Time : 4/18/2022

Model Overview

This is a Conformer architecture model for automatic speech recognition (ASR), supporting English speech transcription with high accuracy and low word error rate (WER)

Model Features

Relative positional embedding

Uses relative positional embedding technology to enhance the model's ability to model positional relationships in speech sequences

High accuracy

Achieves word error rates (WER) of 1.85 (clean) and 3.83 (other) on the LibriSpeech test set

Large-scale training

Pre-trained and fine-tuned on 960 hours of LibriSpeech speech data

Model Capabilities

English speech recognition

16kHz audio processing

Long-sequence speech transcription

Use Cases

Speech transcription

Meeting minutes

Automatically transcribe meeting recordings into text

Highly accurate transcriptions

Voice note conversion

Convert voice notes into editable text

Assistive technology

Real-time caption generation

Generate real-time captions for videos or live streams

🚀 Wav2Vec2-Conformer-Large-960h with Relative Position Embeddings

Wav2Vec2-Conformer with relative position embeddings, pretrained and fine - tuned on 960 hours of Librispeech for 16kHz sampled speech audio.

Key Information

Property	Details
Model Type	Wav2Vec2-Conformer with relative position embeddings
Training Data	960 hours of Librispeech on 16kHz sampled speech audio
Tags	speech, audio, automatic-speech-recognition, hf-asr-leaderboard
License	apache-2.0

Model Results

The model's performance on different datasets is as follows:

Dataset	Task	Metric	Value
LibriSpeech (clean)	Automatic Speech Recognition	Test WER	1.85
LibriSpeech (other)	Automatic Speech Recognition	Test WER	3.83

Paper and Authors

Paper: fairseq S2T: Fast Speech-to-Text Modeling with fairseq
Authors: Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Sravya Popuri, Dmytro Okhonko, Juan Pino

The results of Wav2Vec2-Conformer can be found in Table 3 and Table 4 of the official paper. The original model can be found under https://github.com/pytorch/fairseq/tree/master/examples/wav2vec#wav2vec-20.

🚀 Quick Start

When using the model, make sure that your speech input is also sampled at 16Khz.

💻 Usage Examples

Basic Usage

from transformers import Wav2Vec2Processor, Wav2Vec2ConformerForCTC
from datasets import load_dataset
import torch

# load model and processor
processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-conformer-rel-pos-large-960h-ft")
model = Wav2Vec2ConformerForCTC.from_pretrained("facebook/wav2vec2-conformer-rel-pos-large-960h-ft")

# load dummy dataset and read soundfiles
ds = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")

# tokenize
input_values = processor(ds[0]["audio"]["array"], return_tensors="pt", padding="longest").input_values

# retrieve logits
logits = model(input_values).logits

# take argmax and decode
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)

Advanced Usage

This code snippet shows how to evaluate facebook/wav2vec2-conformer-rel-pos-large-960h-ft on LibriSpeech's "clean" and "other" test data.

from datasets import load_dataset
from transformers import Wav2Vec2ConformerForCTC, Wav2Vec2Processor
import torch
from jiwer import wer

librispeech_eval = load_dataset("librispeech_asr", "clean", split="test")

model = Wav2Vec2ConformerForCTC.from_pretrained("facebook/wav2vec2-large-960h-lv60-self").to("cuda")
processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-large-960h-lv60-self")

def map_to_pred(batch):
    inputs = processor(batch["audio"]["array"], return_tensors="pt", padding="longest")
    input_values = inputs.input_values.to("cuda")
    attention_mask = inputs.attention_mask.to("cuda")

    with torch.no_grad():
        logits = model(input_values, attention_mask=attention_mask).logits

    predicted_ids = torch.argmax(logits, dim=-1)
    transcription = processor.batch_decode(predicted_ids)
    batch["transcription"] = transcription
    return batch

result = librispeech_eval.map(map_to_pred, remove_columns=["audio"])

print("WER:", wer(result["text"], result["transcription"]))

Result (WER):

"clean"	"other"
1.85	3.82

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご