đ OWSM-CTC Speech Foundation Model
OWSM-CTC is an encoder-only speech foundation model for multilingual speech recognition, any-to-any speech translation, and language identification.
đ Quick Start
OWSM-CTC (Peng et al., ACL 2024) is an encoder-only speech foundation model based on hierarchical multi-task self-conditioned CTC.
This model is trained on 180k hours of public audio data for multilingual speech recognition, any-to-any speech translation, and language identification, following the design of the Open Whisper-style Speech Model (OWSM).
Due to time constraint, the model used in the paper was trained for 40 "epochs". A new model trained for 45 "epochs" (approximately three entire passes on the full data) is also added in this repo to match the setup of encoder-decoder OWSM. It performs better than the old one on many test sets.
đĻ Installation
To use the pre-trained model, please install espnet
and espnet_model_zoo
. The requirements are:
librosa
torch
espnet
espnet_model_zoo
The recipe can be found in ESPnet: https://github.com/espnet/espnet/tree/master/egs2/owsm_ctc_v3.1/s2t1
đģ Usage Examples
Basic Usage
Example script for batched inference
Speech2TextGreedySearch
now provides a unified batched inference method batch_decode
. It performs CTC greedy decoding for a batch of short-form or long-form audios. If an audio is shorter than 30s, it will be padded to 30s; otherwise it will be split into overlapped segments (same as the "long-form ASR/ST" method below).
from espnet2.bin.s2t_inference_ctc import Speech2TextGreedySearch
s2t = Speech2TextGreedySearch.from_pretrained(
"espnet/owsm_ctc_v3.1_1B",
device="cuda",
use_flash_attn=False,
lang_sym='<eng>',
task_sym='<asr>',
)
res = s2t.batch_decode(
"audio.wav",
batch_size=16,
context_len_in_secs=4,
)
res = s2t.batch_decode(
["audio1.wav", "audio2.wav", "audio3.wav"],
batch_size=16,
context_len_in_secs=4,
)
Example script for short-form ASR/ST/LID
Our models are trained on 16kHz audio with a fixed duration of 30s. When using the pre-trained model, please ensure the input speech is 16kHz and pad or truncate it to 30s.
import librosa
from espnet2.bin.s2t_inference_ctc import Speech2TextGreedySearch
s2t = Speech2TextGreedySearch.from_pretrained(
"espnet/owsm_ctc_v3.1_1B",
device="cuda",
generate_interctc_outputs=False,
lang_sym='<eng>',
task_sym='<asr>',
)
speech, rate = librosa.load("xxx.wav", sr=16000)
speech = librosa.util.fix_length(speech, size=(16000 * 30))
res = s2t(speech)[0]
print(res)
Example script for long-form ASR/ST
import soundfile as sf
import torch
from espnet2.bin.s2t_inference_ctc import Speech2TextGreedySearch
context_len_in_secs = 4
batch_size = 32
s2t = Speech2TextGreedySearch.from_pretrained(
"espnet/owsm_ctc_v3.1_1B",
device='cuda' if torch.cuda.is_available() else 'cpu',
generate_interctc_outputs=False,
lang_sym='<eng>',
task_sym='<asr>',
)
speech, rate = sf.read(
"xxx.wav"
)
text = s2t.decode_long_batched_buffered(
speech,
batch_size=batch_size,
context_len_in_secs=context_len_in_secs,
)
print(text)
Example of CTC forced alignment using ctc-segmentation
CTC segmentation can be efficiently applied to audio of an arbitrary length.
import soundfile as sf
from espnet2.bin.s2t_ctc_align import CTCSegmentation
from espnet_model_zoo.downloader import ModelDownloader
d = ModelDownloader()
downloaded = d.download_and_unpack("espnet/owsm_ctc_v3.1_1B")
aligner = CTCSegmentation(
**downloaded,
fs=16000,
ngpu=1,
batch_size=32,
kaldi_style_text=True,
time_stamps="auto",
lang_sym="<eng>",
task_sym="<asr>",
context_len_in_secs=2,
)
speech, rate = sf.read(
"./test_utils/ctc_align_test.wav"
)
print(f"speech duration: {len(speech) / rate : .2f} seconds")
text = """
utt1 THE SALE OF THE HOTELS
utt2 IS PART OF HOLIDAY'S STRATEGY
utt3 TO SELL OFF ASSETS
utt4 AND CONCENTRATE ON PROPERTY MANAGEMENT
"""
segments = aligner(speech, text)
print(segments)
đ License
This project is licensed under the cc-by-4.0 license.
đ Documentation
Information Table
Property |
Details |
Tags |
espnet, audio, automatic-speech-recognition, speech-translation, language-identification |
Language |
multilingual |
Datasets |
owsm_v3.1_ctc |
Metrics |
cer, bleu, accuracy |
Library Name |
espnet |
Citations
OWSM-CTC
@inproceedings{owsm-ctc,
title = "{OWSM}-{CTC}: An Open Encoder-Only Speech Foundation Model for Speech Recognition, Translation, and Language Identification",
author = "Peng, Yifan and
Sudo, Yui and
Shakeel, Muhammad and
Watanabe, Shinji",
booktitle = "Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL)",
year = "2024",
month= {8},
url = "https://aclanthology.org/2024.acl-long.549",
}
OWSM v3.1 and v3.2
@inproceedings{owsm-v32,
title={On the Effects of Heterogeneous Data Sources on Speech-to-Text Foundation Models},
author={Jinchuan Tian and Yifan Peng and William Chen and Kwanghee Choi and Karen Livescu and Shinji Watanabe},
booktitle={Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH)},
year={2024},
month={9},
pdf="https://arxiv.org/pdf/2406.09282"
}
@inproceedings{owsm-v31,
title={{OWSM v3.1: Better and Faster Open Whisper-Style Speech Models based on E-Branchformer}},
author={Yifan Peng and Jinchuan Tian and William Chen and Siddhant Arora and Brian Yan and Yui Sudo and Muhammad Shakeel and Kwanghee Choi and Jiatong Shi and Xuankai Chang and Jee-weon Jung and Shinji Watanabe},
booktitle={Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH)},
year={2024},
month={9},
pdf="https://arxiv.org/pdf/2401.16658",
}
Initial OWSM (v1, v2, v3)
@inproceedings{owsm,
title={Reproducing Whisper-Style Training Using An Open-Source Toolkit And Publicly Available Data},
author={Yifan Peng and Jinchuan Tian and Brian Yan and Dan Berrebbi and Xuankai Chang and Xinjian Li and Jiatong Shi and Siddhant Arora and William Chen and Roshan Sharma and Wangyou Zhang and Yui Sudo and Muhammad Shakeel and Jee-weon Jung and Soumi Maiti and Shinji Watanabe},
booktitle={Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)},
year={2023},
month={12},
pdf="https://arxiv.org/pdf/2309.13876",
}