đ SEW-D-base+
A pre-trained speech model with high performance and efficiency for various speech tasks.
SEW-D-base+ is a base model pretrained on 16kHz sampled speech audio. When using the model, ensure that your speech input is also sampled at 16Khz. Note that this model should be fine - tuned on a downstream task, such as Automatic Speech Recognition, Speaker Identification, Intent Classification, Emotion Recognition, etc.
The original SEW - D model is developed by ASAPP Research and can be found at SEW - D by ASAPP Research.
Paper: Performance - Efficiency Trade - offs in Unsupervised Pre - training for Speech Recognition
Authors: Felix Wu, Kwangyoun Kim, Jing Pan, Kyu Han, Kilian Q. Weinberger, Yoav Artzi
Abstract
This paper studies the performance - efficiency trade - offs in pre - trained models for automatic speech recognition (ASR). It focuses on wav2vec 2.0 and formalizes several architecture designs that influence both the model performance and its efficiency. By integrating all the observations, it introduces SEW (Squeezed and Efficient Wav2vec), a pre - trained model architecture with significant improvements in both performance and efficiency across various training setups. For example, under the 100h - 960h semi - supervised setup on LibriSpeech, SEW achieves a 1.9x inference speedup compared to wav2vec 2.0, with a 13.5% relative reduction in word error rate. With a similar inference time, SEW reduces the word error rate by 25 - 50% across different model sizes.
The original model checkpoints can be found at https://github.com/asappresearch/sew#model - checkpoints.
đ Quick Start
To transcribe audio files, the model can be used as a standalone acoustic model.
đģ Usage Examples
Basic Usage
from transformers import Wav2Vec2Processor, SEWDForCTC
from datasets import load_dataset
import soundfile as sf
import torch
processor = Wav2Vec2Processor.from_pretrained("asapp/sew-d-base-plus-400k-ft-ls100h")
model = SEWDForCTC.from_pretrained("asapp/sew-d-base-plus-400k-ft-ls100h")
ds = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")
input_values = processor(ds[0]["audio"]["array"], return_tensors="pt").input_values
logits = model(input_values).logits
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)
Advanced Usage
from datasets import load_dataset
from transformers import SEWDForCTC, Wav2Vec2Processor
import torch
from jiwer import wer
librispeech_eval = load_dataset("librispeech_asr", "clean", split="test")
model = SEWDForCTC.from_pretrained("asapp/sew-d-base-plus-400k-ft-ls100h").to("cuda")
processor = Wav2Vec2Processor.from_pretrained("asapp/sew-d-base-plus-400k-ft-ls100h")
def map_to_pred(batch):
input_values = processor(batch["audio"][0]["array"], sampling_rate=16000,
return_tensors="pt", padding="longest").input_values
with torch.no_grad():
logits = model(input_values.to("cuda")).logits
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)
batch["transcription"] = transcription
return batch
result = librispeech_eval.map(map_to_pred, batched=True, batch_size=1, remove_columns=["audio"])
print("WER:", wer(result["text"], result["transcription"]))
đ Documentation
Evaluation Results
Property |
Details |
Model Type |
SEW - D - base+ |
Training Data |
Librispeech_asr |
Test WER (LibriSpeech "clean") |
4.34 |
Test WER (LibriSpeech "other") |
9.45 |
Model Index
- Name: sew - d - base - plus - 400k - ft - ls100h
- Results:
- Task: Automatic Speech Recognition
- Dataset: LibriSpeech (clean)
- Task: Automatic Speech Recognition
- Dataset: LibriSpeech (other)
đ License
This project is licensed under the Apache - 2.0 license.