Wav2vec2 - Conformer - Rope Large 960h - Ft Open - Source Model - Accurately Complete English Speech Recognition Tasks

Wav2vec2 Conformer Rope Large 960h Ft

Developed by facebook

This model incorporates rotary position embedding technology, is pre-trained and fine-tuned on 960 hours of LibriSpeech data sampled at 16kHz, and is suitable for English speech recognition tasks.

Speech Recognition

Transformers

EnglishOpen Source License:Apache-2.0 #High-precision speech recognition #Rotary position embedding #English speech processing

Downloads 22.02k

Release Time : 4/18/2022

Model Overview

The Wav2Vec2 Conformer model combines rotary position embedding technology, focusing on high-precision English speech recognition, and supports audio input with a 16kHz sampling rate.

Model Features

Rotary Position Embedding Technology

Utilizes Rotary Position Embedding (RoPE) technology, enhancing the model's ability to process long speech sequences.

Large-scale Training Data

Pre-trained and fine-tuned on 960 hours of LibriSpeech audio data.

High-precision Recognition

Achieves a word error rate (WER) of 1.96 (Clean) and 3.98 (Other) on the LibriSpeech test sets.

Model Capabilities

English speech recognition

16kHz audio processing

Long speech sequence transcription

Use Cases

Speech Transcription

Meeting Transcription

Automatically transcribes meeting recordings into text records

Highly accurate transcription results

Voice Note Conversion

Converts voice notes into editable text

Voice Assistant

Voice Command Recognition

Recognizes and understands user voice commands

🚀 Wav2Vec2-Conformer-Large-960h with Rotary Position Embeddings

Wav2Vec2 Conformer with rotary position embeddings, pretrained and fine - tuned on 960 hours of Librispeech, designed for 16kHz sampled speech audio.

Key Information

Property	Details
Model Type	Wav2Vec2-Conformer-Large-960h with Rotary Position Embeddings
Training Data	LibriSpeech ASR
Tags	speech, audio, automatic-speech-recognition, hf-asr-leaderboard
License	apache-2.0

Model Performance

The model has been evaluated on LibriSpeech datasets, and the results are as follows:

Dataset	Test WER
LibriSpeech (clean)	1.96
LibriSpeech (other)	3.98

Paper and Authors

Paper: fairseq S2T: Fast Speech-to-Text Modeling with fairseq
Authors: Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Sravya Popuri, Dmytro Okhonko, Juan Pino

The results of Wav2Vec2-Conformer can be found in Table 3 and Table 4 of the official paper. The original model can be found under https://github.com/pytorch/fairseq/tree/master/examples/wav2vec#wav2vec-20.

🚀 Quick Start

Prerequisites

When using the model, make sure that your speech input is sampled at 16Khz.

💻 Usage Examples

Basic Usage

from transformers import Wav2Vec2Processor, Wav2Vec2ConformerForCTC
from datasets import load_dataset
import torch

# load model and processor
processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-conformer-rope-large-960h-ft")
model = Wav2Vec2ConformerForCTC.from_pretrained("facebook/wav2vec2-conformer-rope-large-960h-ft")

# load dummy dataset and read soundfiles
ds = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")

# tokenize
input_values = processor(ds[0]["audio"]["array"], return_tensors="pt", padding="longest").input_values

# retrieve logits
logits = model(input_values).logits

# take argmax and decode
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)

Advanced Usage

This code snippet shows how to evaluate facebook/wav2vec2-conformer-rope-large-960h-ft on LibriSpeech's "clean" and "other" test data.

from datasets import load_dataset
from transformers import Wav2Vec2ConformerForCTC, Wav2Vec2Processor
import torch
from jiwer import wer


librispeech_eval = load_dataset("librispeech_asr", "clean", split="test")

model = Wav2Vec2ConformerForCTC.from_pretrained("facebook/wav2vec2-conformer-rope-large-960h-ft").to("cuda")
processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-conformer-rope-large-960h-ft")

def map_to_pred(batch):
    inputs = processor(batch["audio"]["array"], return_tensors="pt", padding="longest")
    input_values = inputs.input_values.to("cuda")
    attention_mask = inputs.attention_mask.to("cuda")

    with torch.no_grad():
        logits = model(input_values, attention_mask=attention_mask).logits

    predicted_ids = torch.argmax(logits, dim=-1)
    transcription = processor.batch_decode(predicted_ids)
    batch["transcription"] = transcription
    return batch

result = librispeech_eval.map(map_to_pred, remove_columns=["audio"])

print("WER:", wer(result["text"], result["transcription"]))

Result (WER):

"clean"	"other"
1.96	3.98

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご