Sew-tiny-100k-ft-ls100h Open-source Speech Recognition Model - More Efficient and Accurate than Wav2vec 2.0

Sew Tiny 100k Ft Ls100h

Developed by asapp

SEW (Squeezed and Efficient Wav2vec) is a speech recognition pre-trained model developed by ASAPP Research, outperforming wav2vec 2.0 in both performance and efficiency.

Speech Recognition

Transformers

Supports Multiple LanguagesOpen Source License:Apache-2.0 #Efficient Speech Recognition #Low-resource Fine-tuning #16kHz Audio Processing

Downloads 736

Release Time : 3/2/2022

Model Overview

A speech recognition model pre-trained on 16kHz sampled audio, requiring fine-tuning for downstream tasks.

Model Features

Efficient Performance

Achieves 1.9x inference speedup compared to wav2vec 2.0 with a 13.5% reduction in word error rate.

Compressed Architecture

Optimized model architecture reduces computational resource requirements while maintaining performance.

Multi-task Adaptation

Can be fine-tuned for various speech tasks such as ASR, speaker recognition, and intent classification.

Model Capabilities

Speech Recognition

Speech-to-Text

Audio Feature Extraction

Use Cases

Speech Transcription

LibriSpeech Transcription

Transcribing English audiobook content into text.

Achieves WER 10.61 on LibriSpeech clean test set and WER 23.74 on other test set.

Speech Application Development

Voice Assistant

Serving as the speech recognition component for voice assistants.

🚀 SEW-tiny

The SEW-tiny model is a pre - trained speech model by ASAPP Research, suitable for various speech - related downstream tasks.

🚀 Quick Start

The base model is pretrained on 16kHz sampled speech audio. When using the model, ensure that your speech input is also sampled at 16Khz. Note that this model should be fine - tuned on a downstream task, like Automatic Speech Recognition, Speaker Identification, Intent Classification, Emotion Recognition, etc...

Paper: Performance - Efficiency Trade - offs in Unsupervised Pre - training for Speech Recognition

Authors: Felix Wu, Kwangyoun Kim, Jing Pan, Kyu Han, Kilian Q. Weinberger, Yoav Artzi

Abstract This paper is a study of performance - efficiency trade - offs in pre - trained models for automatic speech recognition (ASR). We focus on wav2vec 2.0, and formalize several architecture designs that influence both the model performance and its efficiency. Putting together all our observations, we introduce SEW (Squeezed and Efficient Wav2vec), a pre - trained model architecture with significant improvements along both performance and efficiency dimensions across a variety of training setups. For example, under the 100h - 960h semi - supervised setup on LibriSpeech, SEW achieves a 1.9x inference speedup compared to wav2vec 2.0, with a 13.5% relative reduction in word error rate. With a similar inference time, SEW reduces word error rate by 25 - 50% across different model sizes.

The original model can be found under https://github.com/asappresearch/sew#model - checkpoints.

✨ Features

Audio - related Tasks: Suitable for multiple audio tasks such as Automatic Speech Recognition, Speaker Identification, Intent Classification, and Emotion Recognition.
High - efficiency: Based on the SEW architecture, it has significant improvements in both performance and efficiency compared to wav2vec 2.0.

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

To transcribe audio files, the model can be used as a standalone acoustic model as follows:

from transformers import Wav2Vec2Processor, SEWForCTC
from datasets import load_dataset
import soundfile as sf
import torch
 
# load the model and preprocessor
processor = Wav2Vec2Processor.from_pretrained("asapp/sew-tiny-100k-ft-ls100h")
model = SEWForCTC.from_pretrained("asapp/sew-tiny-100k-ft-ls100h")

# load the dummy dataset with speech samples
ds = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")
 
# preprocess
input_values = processor(ds[0]["audio"]["array"], return_tensors="pt").input_values  # Batch size 1

# retrieve logits
logits = model(input_values).logits
 
# take argmax and decode
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)

Advanced Usage

This code snippet shows how to evaluate asapp/sew - tiny - 100k - ft - ls100h on LibriSpeech's "clean" and "other" test data.

from datasets import load_dataset
from transformers import SEWForCTC, Wav2Vec2Processor
import torch
from jiwer import wer

librispeech_eval = load_dataset("librispeech_asr", "clean", split="test")

model = SEWForCTC.from_pretrained("asapp/sew-tiny-100k-ft-ls100h").to("cuda")
processor = Wav2Vec2Processor.from_pretrained("asapp/sew-tiny-100k-ft-ls100h")

def map_to_pred(batch):
    input_values = processor(batch["audio"][0]["array"], sampling_rate=16000, 
                             return_tensors="pt", padding="longest").input_values
    with torch.no_grad():
        logits = model(input_values.to("cuda")).logits

    predicted_ids = torch.argmax(logits, dim=-1)
    transcription = processor.batch_decode(predicted_ids)
    batch["transcription"] = transcription
    return batch

result = librispeech_eval.map(map_to_pred, batched=True, batch_size=1, remove_columns=["audio"])

print("WER:", wer(result["text"], result["transcription"]))

Result (WER):

"clean"	"other"
10.61	23.74

📚 Documentation

Model Information

Property	Details
Model Type	SEW - tiny
Training Data	librispeech_asr

Widget Examples

Librispeech sample 1: [Audio Link](https://cdn - media.huggingface.co/speech_samples/sample1.flac)
Librispeech sample 2: [Audio Link](https://cdn - media.huggingface.co/speech_samples/sample2.flac)

Model Index

Name: sew - tiny - 100k - ft - ls100h
Results:
- Task: Automatic Speech Recognition
  - Dataset: LibriSpeech (clean)
    - Metrics: Test WER = 10.61
  - Dataset: LibriSpeech (other)
    - Metrics: Test WER = 23.74

📄 License

This project is licensed under the apache - 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご