đ SEW-D-mid-k127
SEW-D by ASAPP Research is a base model pretrained on 16kHz sampled speech audio, suitable for various speech-related downstream tasks.
The base model is pretrained on 16kHz sampled speech audio. When using the model, ensure that your speech input is also sampled at 16Khz. Note that this model should be fine - tuned on a downstream task, like Automatic Speech Recognition, Speaker Identification, Intent Classification, Emotion Recognition, etc...
Paper: Performance - Efficiency Trade - offs in Unsupervised Pre - training for Speech Recognition
Authors: Felix Wu, Kwangyoun Kim, Jing Pan, Kyu Han, Kilian Q. Weinberger, Yoav Artzi
Abstract
This paper is a study of performance - efficiency trade - offs in pre - trained models for automatic speech recognition (ASR). We focus on wav2vec 2.0, and formalize several architecture designs that influence both the model performance and its efficiency. Putting together all our observations, we introduce SEW (Squeezed and Efficient Wav2vec), a pre - trained model architecture with significant improvements along both performance and efficiency dimensions across a variety of training setups. For example, under the 100h - 960h semi - supervised setup on LibriSpeech, SEW achieves a 1.9x inference speedup compared to wav2vec 2.0, with a 13.5% relative reduction in word error rate. With a similar inference time, SEW reduces word error rate by 25 - 50% across different model sizes.
The original model can be found under https://github.com/asappresearch/sew#model - checkpoints.
đ Quick Start
The model can be used as a standalone acoustic model for transcribing audio files.
đģ Usage Examples
Basic Usage
from transformers import Wav2Vec2Processor, SEWDForCTC
from datasets import load_dataset
import soundfile as sf
import torch
processor = Wav2Vec2Processor.from_pretrained("asapp/sew-d-mid-k127-400k-ft-ls100h")
model = SEWDForCTC.from_pretrained("asapp/sew-d-mid-k127-400k-ft-ls100h")
ds = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")
input_values = processor(ds[0]["audio"]["array"], return_tensors="pt").input_values
logits = model(input_values).logits
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)
Advanced Usage
from datasets import load_dataset
from transformers import SEWDForCTC, Wav2Vec2Processor
import torch
from jiwer import wer
librispeech_eval = load_dataset("librispeech_asr", "clean", split="test")
model = SEWDForCTC.from_pretrained("asapp/sew-d-mid-k127-400k-ft-ls100h").to("cuda")
processor = Wav2Vec2Processor.from_pretrained("asapp/sew-d-mid-k127-400k-ft-ls100h")
def map_to_pred(batch):
input_values = processor(batch["audio"][0]["array"], sampling_rate=16000,
return_tensors="pt", padding="longest").input_values
with torch.no_grad():
logits = model(input_values.to("cuda")).logits
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)
batch["transcription"] = transcription
return batch
result = librispeech_eval.map(map_to_pred, batched=True, batch_size=1, remove_columns=["audio"])
print("WER:", wer(result["text"], result["transcription"]))
đ Documentation
Evaluation Results
This code snippet shows how to evaluate asapp/sew - d - mid - k127 - 400k - ft - ls100hh on LibriSpeech's "clean" and "other" test data.
Property |
Details |
Model Type |
SEW-D-mid-k127 |
Training Data |
LibriSpeech (clean and other) |
Result (WER):
"clean" |
"other" |
4.99 |
10.95 |
đ License
This project is licensed under the Apache - 2.0 license.