SEW-D-tiny-100k-ft-ls100h Open-Source Speech Recognition Model - Precise Speech Recognition Balancing Performance and Efficiency

Sew D Tiny 100k Ft Ls100h

Developed by asapp

SEW-D-tiny is an efficient speech recognition pre-trained model developed by ASAPP Research, focusing on the balance between performance and efficiency.

Speech Recognition

Transformers

EnglishOpen Source License:Apache-2.0 #Efficient Speech Recognition #Lightweight Model #Low-resource Fine-tuning

Downloads 24.55k

Release Time : 3/2/2022

Model Overview

This model is pre-trained on 16kHz sampled speech audio and is suitable for downstream tasks such as automatic speech recognition, speaker recognition, and intent classification.

Model Features

Efficient Inference

Achieves 1.9x inference speedup compared to wav2vec 2.0.

Performance Improvement

Reduces word error rate by 13.5% relative in the semi-supervised setting of LibriSpeech 100h-960h.

Lightweight

The model design emphasizes efficiency, making it suitable for resource-constrained environments.

Model Capabilities

Speech Recognition

Speaker Recognition

Intent Classification

Emotion Recognition

Use Cases

Speech-to-Text

LibriSpeech Transcription

Convert speech in the LibriSpeech dataset to text.

Achieves a WER of 10.47 on the LibriSpeech clean test set and 22.73 on the other test set.

🚀 SEW-D-tiny

SEW-D-tiny is a base model pretrained on 16kHz sampled speech audio, suitable for downstream tasks like Automatic Speech Recognition, Speaker Identification, etc.

🚀 Quick Start

The base model is pretrained on 16kHz sampled speech audio. When using the model, ensure that your speech input is also sampled at 16Khz. Note that this model should be fine - tuned on a downstream task, such as Automatic Speech Recognition, Speaker Identification, Intent Classification, Emotion Recognition, etc.

The original model can be found under SEW-D by ASAPP Research.

Paper: Performance - Efficiency Trade - offs in Unsupervised Pre - training for Speech Recognition

Authors: Felix Wu, Kwangyoun Kim, Jing Pan, Kyu Han, Kilian Q. Weinberger, Yoav Artzi

Abstract This paper is a study of performance - efficiency trade - offs in pre - trained models for automatic speech recognition (ASR). We focus on wav2vec 2.0, and formalize several architecture designs that influence both the model performance and its efficiency. Putting together all our observations, we introduce SEW (Squeezed and Efficient Wav2vec), a pre - trained model architecture with significant improvements along both performance and efficiency dimensions across a variety of training setups. For example, under the 100h - 960h semi - supervised setup on LibriSpeech, SEW achieves a 1.9x inference speedup compared to wav2vec 2.0, with a 13.5% relative reduction in word error rate. With a similar inference time, SEW reduces word error rate by 25 - 50% across different model sizes.

💻 Usage Examples

Basic Usage

To transcribe audio files, the model can be used as a standalone acoustic model as follows:

from transformers import Wav2Vec2Processor, SEWDForCTC
from datasets import load_dataset
import soundfile as sf
import torch

# load the model and preprocessor
processor = Wav2Vec2Processor.from_pretrained("asapp/sew-d-tiny-100k-ft-ls100h")
model = SEWDForCTC.from_pretrained("asapp/sew-d-tiny-100k-ft-ls100h")

# load the dummy dataset with speech samples
ds = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")

# preprocess
input_values = processor(ds[0]["audio"]["array"], return_tensors="pt").input_values  # Batch size 1

# retrieve logits
logits = model(input_values).logits

# take argmax and decode
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)

Advanced Usage

This code snippet shows how to evaluate asapp/sew - d - tiny - 100k - ft - ls100h on LibriSpeech's "clean" and "other" test data.

from datasets import load_dataset
from transformers import SEWDForCTC, Wav2Vec2Processor
import torch
from jiwer import wer

librispeech_eval = load_dataset("librispeech_asr", "clean", split="test")

model = SEWDForCTC.from_pretrained("asapp/sew-d-tiny-100k-ft-ls100h").to("cuda")
processor = Wav2Vec2Processor.from_pretrained("asapp/sew-d-tiny-100k-ft-ls100h")

def map_to_pred(batch):
    input_values = processor(batch["audio"][0]["array"], sampling_rate=16000, 
                             return_tensors="pt", padding="longest").input_values
    with torch.no_grad():
        logits = model(input_values.to("cuda")).logits

    predicted_ids = torch.argmax(logits, dim=-1)
    transcription = processor.batch_decode(predicted_ids)
    batch["transcription"] = transcription
    return batch

result = librispeech_eval.map(map_to_pred, batched=True, batch_size=1, remove_columns=["audio"])

print("WER:", wer(result["text"], result["transcription"]))

Result (WER):

"clean"	"other"
10.47	22.73

📄 License

This project is licensed under the Apache 2.0 license.

📚 Documentation

Model Information

Property	Details
Model Type	SEW - D - tiny
Training Data	librispeech_asr
Tags	audio, speech, automatic - speech - recognition, hf - asr - leaderboard

Widget Examples

Librispeech sample 1: [Audio](https://cdn - media.huggingface.co/speech_samples/sample1.flac)
Librispeech sample 2: [Audio](https://cdn - media.huggingface.co/speech_samples/sample2.flac)

Model Index

Model Name: sew - d - tiny - 100k - ft - ls100h
- Task: Automatic Speech Recognition
- Dataset:
  - LibriSpeech (clean): Test WER = 10.47
  - LibriSpeech (other): Test WER = 22.73

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご