Sew-D-Mid-K127-400K-FT-LS100H Open Source Speech Recognition Model - Efficient Recognition Far Outperforms Wav2Vec 2.0

Sew D Mid K127 400k Ft Ls100h

Developed by asapp

SEW-D-mid-k127 is an efficient speech recognition pre-trained model developed by ASAPP Research, demonstrating significant improvements in performance and efficiency compared to wav2vec 2.0.

Speech Recognition

Transformers

EnglishOpen Source License:Apache-2.0 #Efficient Speech Recognition #Low Word Error Rate #16kHz Audio Processing

Downloads 16

Release Time : 3/2/2022

Model Overview

This model is a pre-trained model for Automatic Speech Recognition (ASR), based on the SEW (Squeezed and Efficient Wav2vec) architecture. It is pre-trained on 16kHz sampled speech audio and requires fine-tuning for specific tasks before use.

Model Features

Efficient Architecture Design

Achieves 1.9x inference speedup compared to wav2vec 2.0 while maintaining or improving recognition accuracy.

Performance Optimization

Reduces word error rate by 25-50% across different model sizes.

Multi-Task Applicability

Can be fine-tuned for downstream tasks such as automatic speech recognition, speaker recognition, intent classification, and emotion recognition.

Model Capabilities

English Speech Recognition

Speech Feature Extraction

Audio Content Transcription

Use Cases

Speech Transcription

Meeting Minutes

Automatically transcribe meeting recordings into text records.

WER 4.99 on LibriSpeech clean test set

Speech-to-Text Service

Provide speech-to-text conversion functionality for applications.

WER 10.95 on LibriSpeech other test set

Speech Analysis

Speaker Recognition

Identify and analyze speech features of different speakers.

🚀 SEW-D-mid-k127

SEW-D by ASAPP Research is a base model pretrained on 16kHz sampled speech audio, suitable for various speech-related downstream tasks.

The base model is pretrained on 16kHz sampled speech audio. When using the model, ensure that your speech input is also sampled at 16Khz. Note that this model should be fine - tuned on a downstream task, like Automatic Speech Recognition, Speaker Identification, Intent Classification, Emotion Recognition, etc...

Paper: Performance - Efficiency Trade - offs in Unsupervised Pre - training for Speech Recognition

Authors: Felix Wu, Kwangyoun Kim, Jing Pan, Kyu Han, Kilian Q. Weinberger, Yoav Artzi

Abstract This paper is a study of performance - efficiency trade - offs in pre - trained models for automatic speech recognition (ASR). We focus on wav2vec 2.0, and formalize several architecture designs that influence both the model performance and its efficiency. Putting together all our observations, we introduce SEW (Squeezed and Efficient Wav2vec), a pre - trained model architecture with significant improvements along both performance and efficiency dimensions across a variety of training setups. For example, under the 100h - 960h semi - supervised setup on LibriSpeech, SEW achieves a 1.9x inference speedup compared to wav2vec 2.0, with a 13.5% relative reduction in word error rate. With a similar inference time, SEW reduces word error rate by 25 - 50% across different model sizes.

The original model can be found under https://github.com/asappresearch/sew#model - checkpoints.

🚀 Quick Start

The model can be used as a standalone acoustic model for transcribing audio files.

💻 Usage Examples

Basic Usage

from transformers import Wav2Vec2Processor, SEWDForCTC
from datasets import load_dataset
import soundfile as sf
import torch

# load the model and preprocessor
processor = Wav2Vec2Processor.from_pretrained("asapp/sew-d-mid-k127-400k-ft-ls100h")
model = SEWDForCTC.from_pretrained("asapp/sew-d-mid-k127-400k-ft-ls100h")

# load the dummy dataset with speech samples
ds = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")

# preprocess
input_values = processor(ds[0]["audio"]["array"], return_tensors="pt").input_values  # Batch size 1

# retrieve logits
logits = model(input_values).logits

# take argmax and decode
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)

Advanced Usage

from datasets import load_dataset
from transformers import SEWDForCTC, Wav2Vec2Processor
import torch
from jiwer import wer

librispeech_eval = load_dataset("librispeech_asr", "clean", split="test")

model = SEWDForCTC.from_pretrained("asapp/sew-d-mid-k127-400k-ft-ls100h").to("cuda")
processor = Wav2Vec2Processor.from_pretrained("asapp/sew-d-mid-k127-400k-ft-ls100h")

def map_to_pred(batch):
    input_values = processor(batch["audio"][0]["array"], sampling_rate=16000, 
                             return_tensors="pt", padding="longest").input_values
    with torch.no_grad():
        logits = model(input_values.to("cuda")).logits

    predicted_ids = torch.argmax(logits, dim=-1)
    transcription = processor.batch_decode(predicted_ids)
    batch["transcription"] = transcription
    return batch

result = librispeech_eval.map(map_to_pred, batched=True, batch_size=1, remove_columns=["audio"])

print("WER:", wer(result["text"], result["transcription"]))

📚 Documentation

Evaluation Results

This code snippet shows how to evaluate asapp/sew - d - mid - k127 - 400k - ft - ls100hh on LibriSpeech's "clean" and "other" test data.

Property	Details
Model Type	SEW-D-mid-k127
Training Data	LibriSpeech (clean and other)

Result (WER):

"clean"	"other"
4.99	10.95

📄 License

This project is licensed under the Apache - 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご