Open-source Speech Recognition Model wav2vec2-conformer-rel-pos-large-100h-ft: Accurate Recognition for Efficient Speech Processing

Wav2vec2 Conformer Rel Pos Large 100h Ft

Developed by facebook

A large-scale Wav2Vec2-Conformer speech recognition model using relative position embedding, fine-tuned on 100 hours of Librispeech data

Speech Recognition

Transformers

EnglishOpen Source License:Apache-2.0 #Relative Position Embedding #High-precision Speech Recognition #Librispeech Fine-tuning

Downloads 99

Release Time : 4/18/2022

Model Overview

This is an automatic speech recognition (ASR) model based on the Wav2Vec2-Conformer architecture, employing relative position embedding technology, fine-tuned on 100 hours of Librispeech data, suitable for English speech recognition tasks with 16kHz sampling rate.

Model Features

Relative Position Embedding

Uses relative position embedding technology, potentially improving performance for long-sequence speech recognition

Conformer Architecture

Combines the advantages of Transformer and CNN, capable of capturing both local and global speech features

Efficient Training

Fine-tuned on 100 hours of Librispeech data, more efficient compared to full 960-hour training

Model Capabilities

English Speech Recognition

16kHz Sampling Rate Audio Processing

Use Cases

Speech-to-Text

Meeting Minutes

Automatically convert English meeting recordings into text transcripts

Podcast Transcription

Transcribe English podcast content into text

🚀 Wav2Vec2-Conformer-Large-100h with Relative Position Embeddings

This project is a Wav2Vec2 Conformer model with relative position embeddings. It is pretrained on 960 hours of Librispeech and fine - tuned on 100 hours of Librispeech for 16kHz sampled speech audio.

🚀 Quick Start

The Wav2Vec2 Conformer with relative position embeddings is pretrained on 960h of Librispeech and fine - tuned on 100 hours of Librispeech for 16kHz sampled speech audio. When using the model, ensure that your speech input is also sampled at 16Khz.

Paper: fairseq S2T: Fast Speech - to - Text Modeling with fairseq

Authors: Changhan Wang, Yun Tang, Xutai Ma, Anne Wu, Sravya Popuri, Dmytro Okhonko, Juan Pino

The results of Wav2Vec2 - Conformer can be found in Table 3 and Table 4 of the official paper. The original model can be found under https://github.com/pytorch/fairseq/tree/master/examples/wav2vec#wav2vec - 20.

📦 Installation

No installation steps are provided in the original document, so this section is skipped.

💻 Usage Examples

Basic Usage

 from transformers import Wav2Vec2Processor, Wav2Vec2ConformerForCTC
 from datasets import load_dataset
 import torch
 
 # load model and processor
 processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-conformer-rel-pos-large-100h-ft")
 model = Wav2Vec2ConformerForCTC.from_pretrained("facebook/wav2vec2-conformer-rel-pos-large-100h-ft")
     
 # load dummy dataset and read soundfiles
 ds = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")
 
 # tokenize
 input_values = processor(ds[0]["audio"]["array"], return_tensors="pt", padding="longest").input_values
 
 # retrieve logits
 logits = model(input_values).logits
 
 # take argmax and decode
 predicted_ids = torch.argmax(logits, dim=-1)
 transcription = processor.batch_decode(predicted_ids)

📚 Documentation

No detailed documentation other than the usage example is provided in the original document, so this section is skipped.

🔧 Technical Details

No technical implementation details are provided in the original document, so this section is skipped.

📄 License

The project is licensed under the apache - 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご