SEW-D-base+ Open-Source Speech Recognition Model - Free Deployment for Efficient Speech Recognition

Sew D Base Plus 400k Ft Ls100h

Developed by asapp

SEW-D-base+ is an efficient speech recognition model developed by ASAPP Research, pre-trained on 16kHz sampled speech audio, and excels on the LibriSpeech dataset.

Speech Recognition

Transformers

EnglishOpen Source License:Apache-2.0 #Efficient Speech Recognition #Low Word Error Rate #16kHz Audio Adaptation

Downloads 66

Release Time : 3/2/2022

Model Overview

This model is an efficient automatic speech recognition (ASR) model optimized for downstream tasks such as speech recognition, speaker recognition, intent classification, etc. Compared to wav2vec 2.0, it significantly improves inference efficiency while maintaining performance.

Model Features

Efficient Inference

Achieves 1.9x inference speedup compared to wav2vec 2.0

Performance Optimization

Reduces word error rate by 13.5% in the LibriSpeech 100h-960h semi-supervised setting

Multi-Task Adaptation

Can be fine-tuned for various downstream tasks, including speech recognition, speaker recognition, intent classification, etc.

Model Capabilities

Speech Recognition

Speaker Recognition

Intent Classification

Emotion Recognition

Use Cases

Speech Transcription

Meeting Transcription

Automatically transcribes meeting recordings into text records

WER 4.34 on the LibriSpeech clean test set

Voice Assistant

Used as the speech recognition module for smart voice assistants

WER 9.45 on the LibriSpeech other test set

🚀 SEW-D-base+

A pre-trained speech model with high performance and efficiency for various speech tasks.

SEW-D-base+ is a base model pretrained on 16kHz sampled speech audio. When using the model, ensure that your speech input is also sampled at 16Khz. Note that this model should be fine - tuned on a downstream task, such as Automatic Speech Recognition, Speaker Identification, Intent Classification, Emotion Recognition, etc.

The original SEW - D model is developed by ASAPP Research and can be found at SEW - D by ASAPP Research.

Paper: Performance - Efficiency Trade - offs in Unsupervised Pre - training for Speech Recognition

Authors: Felix Wu, Kwangyoun Kim, Jing Pan, Kyu Han, Kilian Q. Weinberger, Yoav Artzi

Abstract This paper studies the performance - efficiency trade - offs in pre - trained models for automatic speech recognition (ASR). It focuses on wav2vec 2.0 and formalizes several architecture designs that influence both the model performance and its efficiency. By integrating all the observations, it introduces SEW (Squeezed and Efficient Wav2vec), a pre - trained model architecture with significant improvements in both performance and efficiency across various training setups. For example, under the 100h - 960h semi - supervised setup on LibriSpeech, SEW achieves a 1.9x inference speedup compared to wav2vec 2.0, with a 13.5% relative reduction in word error rate. With a similar inference time, SEW reduces the word error rate by 25 - 50% across different model sizes.

The original model checkpoints can be found at https://github.com/asappresearch/sew#model - checkpoints.

🚀 Quick Start

To transcribe audio files, the model can be used as a standalone acoustic model.

💻 Usage Examples

Basic Usage

from transformers import Wav2Vec2Processor, SEWDForCTC
from datasets import load_dataset
import soundfile as sf
import torch
 
# load the model and preprocessor
processor = Wav2Vec2Processor.from_pretrained("asapp/sew-d-base-plus-400k-ft-ls100h")
model = SEWDForCTC.from_pretrained("asapp/sew-d-base-plus-400k-ft-ls100h")

# load the dummy dataset with speech samples
ds = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")
 
# preprocess
input_values = processor(ds[0]["audio"]["array"], return_tensors="pt").input_values  # Batch size 1

# retrieve logits
logits = model(input_values).logits
 
# take argmax and decode
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)

Advanced Usage

from datasets import load_dataset
from transformers import SEWDForCTC, Wav2Vec2Processor
import torch
from jiwer import wer

librispeech_eval = load_dataset("librispeech_asr", "clean", split="test")

model = SEWDForCTC.from_pretrained("asapp/sew-d-base-plus-400k-ft-ls100h").to("cuda")
processor = Wav2Vec2Processor.from_pretrained("asapp/sew-d-base-plus-400k-ft-ls100h")

def map_to_pred(batch):
    input_values = processor(batch["audio"][0]["array"], sampling_rate=16000, 
                             return_tensors="pt", padding="longest").input_values
    with torch.no_grad():
        logits = model(input_values.to("cuda")).logits

    predicted_ids = torch.argmax(logits, dim=-1)
    transcription = processor.batch_decode(predicted_ids)
    batch["transcription"] = transcription
    return batch

result = librispeech_eval.map(map_to_pred, batched=True, batch_size=1, remove_columns=["audio"])

print("WER:", wer(result["text"], result["transcription"]))

📚 Documentation

Evaluation Results

Property	Details
Model Type	SEW - D - base+
Training Data	Librispeech_asr
Test WER (LibriSpeech "clean")	4.34
Test WER (LibriSpeech "other")	9.45

Model Index

Name: sew - d - base - plus - 400k - ft - ls100h
- Results:
  - Task: Automatic Speech Recognition
    - Dataset: LibriSpeech (clean)
      - Metrics: Test WER = 4.34
  - Task: Automatic Speech Recognition
    - Dataset: LibriSpeech (other)
      - Metrics: Test WER = 9.45

📄 License

This project is licensed under the Apache - 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご