Sew-d-mid-400k-ft-ls100h Open-Source Speech Pretrained Model - Efficiently Complete Automatic Speech Recognition Tasks

Sew D Mid 400k Ft Ls100h

Developed by asapp

SEW-D-mid is a speech pre-training model developed by ASAPP Research, focusing on automatic speech recognition tasks, achieving a good balance between performance and efficiency.

Speech Recognition

Transformers

EnglishOpen Source License:Apache-2.0 #Efficient Speech Recognition #Low Word Error Rate #16kHz Audio Processing

Downloads 20

Release Time : 3/2/2022

Model Overview

This model is a speech pre-training model based on the SEW architecture, pre-trained on 16kHz sampled speech audio, suitable for downstream tasks such as automatic speech recognition, speaker recognition, intent classification, etc.

Model Features

Efficiency-Performance Balance

Achieves 1.9x inference speedup compared to wav2vec 2.0 while reducing word error rate by 13.5%

Multi-task Applicability

Can be fine-tuned for various speech-related downstream tasks, including ASR, speaker recognition, intent classification, etc.

Optimized Architecture Design

Adopts the SEW architecture, incorporating multiple optimization designs to improve model efficiency

Model Capabilities

Speech Recognition

Speech Feature Extraction

Audio Content Understanding

Use Cases

Speech Transcription

Meeting Minutes Transcription

Automatically transcribe meeting recordings into text records

WER of 4.94 on the LibriSpeech clean test set

Voice Command Recognition

Recognize and understand voice commands

Speech Analysis

Speaker Recognition

Identify speaker characteristics in speech

🚀 SEW-D-mid

SEW-D-mid is a base model pretrained on 16kHz sampled speech audio, suitable for fine - tuning on various downstream speech tasks.

📚 Documentation

SEW-D by ASAPP Research
The base model is pretrained on 16kHz sampled speech audio. When using the model, ensure that your speech input is also sampled at 16Khz. Note that this model should be fine - tuned on a downstream task, like Automatic Speech Recognition, Speaker Identification, Intent Classification, Emotion Recognition, etc...
Paper: Performance - Efficiency Trade - offs in Unsupervised Pre - training for Speech Recognition
Authors: Felix Wu, Kwangyoun Kim, Jing Pan, Kyu Han, Kilian Q. Weinberger, Yoav Artzi

Abstract This paper is a study of performance - efficiency trade - offs in pre - trained models for automatic speech recognition (ASR). We focus on wav2vec 2.0, and formalize several architecture designs that influence both the model performance and its efficiency. Putting together all our observations, we introduce SEW (Squeezed and Efficient Wav2vec), a pre - trained model architecture with significant improvements along both performance and efficiency dimensions across a variety of training setups. For example, under the 100h - 960h semi - supervised setup on LibriSpeech, SEW achieves a 1.9x inference speedup compared to wav2vec 2.0, with a 13.5% relative reduction in word error rate. With a similar inference time, SEW reduces word error rate by 25 - 50% across different model sizes.

The original model can be found under https://github.com/asappresearch/sew#model - checkpoints.

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

To transcribe audio files the model can be used as a standalone acoustic model as follows:

from transformers import Wav2Vec2Processor, SEWDForCTC
from datasets import load_dataset
import soundfile as sf
import torch
 
# load the model and preprocessor
processor = Wav2Vec2Processor.from_pretrained("asapp/sew-d-mid-400k-ft-ls100h")
model = SEWDForCTC.from_pretrained("asapp/sew-d-mid-400k-ft-ls100h")

# load the dummy dataset with speech samples
ds = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")
 
# preprocess
input_values = processor(ds[0]["audio"]["array"], return_tensors="pt").input_values  # Batch size 1

# retrieve logits
logits = model(input_values).logits
 
# take argmax and decode
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)

Advanced Usage

This code snippet shows how to evaluate asapp/sew - d - mid - 400k - ft - ls100hh on LibriSpeech's "clean" and "other" test data.

from datasets import load_dataset
from transformers import SEWDForCTC, Wav2Vec2Processor
import torch
from jiwer import wer

librispeech_eval = load_dataset("librispeech_asr", "clean", split="test")

model = SEWDForCTC.from_pretrained("asapp/sew-d-mid-400k-ft-ls100h").to("cuda")
processor = Wav2Vec2Processor.from_pretrained("asapp/sew-d-mid-400k-ft-ls100h")

def map_to_pred(batch):
    input_values = processor(batch["audio"][0]["array"], sampling_rate=16000, 
                             return_tensors="pt", padding="longest").input_values
    with torch.no_grad():
        logits = model(input_values.to("cuda")).logits

    predicted_ids = torch.argmax(logits, dim=-1)
    transcription = processor.batch_decode(predicted_ids)
    batch["transcription"] = transcription
    return batch

result = librispeech_eval.map(map_to_pred, batched=True, batch_size=1, remove_columns=["audio"])

print("WER:", wer(result["text"], result["transcription"]))

📊 Results

Property	Details
Model Type	SEW - D - mid
Training Data	Librispeech_asr

Result (WER):

"clean"	"other"
4.94	11.51

📄 License

The model is licensed under the apache - 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご