Ows m_ctc_v3.1_1B Open-source Speech Model - Supports Multilingual Speech Recognition, Translation, and Language Identification

Owsm Ctc V3.1 1B

Developed by espnet

OWSM-CTC is an encoder-only speech foundation model based on hierarchical multi-task self-conditioned CTC, supporting multilingual speech recognition, speech translation, and language identification.

Speech Recognition Other#Multilingual speech recognition #Arbitrary speech translation #Encoder-only architecture

Downloads 116

Release Time : 2/23/2024

Model Overview

This model was trained on 180k hours of public audio data, following the design of the Open Whisper-style Speech Model (OWSM) project, supporting multilingual speech recognition, arbitrary-to-arbitrary speech translation, and language identification.

Model Features

Multi-task learning

Supports three tasks: speech recognition, speech translation, and language identification

Large-scale training

Trained on 180k hours of public audio data

Efficient inference

Provides batch inference and long audio processing capabilities

CTC forced alignment

Supports efficient timestamp alignment using ctc-segmentation

Model Capabilities

Multilingual speech recognition

Arbitrary-to-arbitrary speech translation

Language identification

Batch audio processing

Long audio segmentation processing

CTC timestamp alignment

Use Cases

Speech transcription

Meeting minutes transcription

Convert meeting recordings into text transcripts

Highly accurate transcription text

Speech translation

Real-time speech translation

Translate speech from one language to text in another language in real-time

Smooth cross-language communication

Audio analysis

Language identification

Identify the language type in audio

Accurate language classification

🚀 OWSM-CTC Speech Foundation Model

OWSM-CTC is an encoder-only speech foundation model for multilingual speech recognition, any-to-any speech translation, and language identification.

🚀 Quick Start

OWSM-CTC (Peng et al., ACL 2024) is an encoder-only speech foundation model based on hierarchical multi-task self-conditioned CTC.

This model is trained on 180k hours of public audio data for multilingual speech recognition, any-to-any speech translation, and language identification, following the design of the Open Whisper-style Speech Model (OWSM).

Due to time constraint, the model used in the paper was trained for 40 "epochs". A new model trained for 45 "epochs" (approximately three entire passes on the full data) is also added in this repo to match the setup of encoder-decoder OWSM. It performs better than the old one on many test sets.

📦 Installation

To use the pre-trained model, please install espnet and espnet_model_zoo. The requirements are:

librosa
torch
espnet
espnet_model_zoo

The recipe can be found in ESPnet: https://github.com/espnet/espnet/tree/master/egs2/owsm_ctc_v3.1/s2t1

💻 Usage Examples

Basic Usage

Example script for batched inference

Speech2TextGreedySearch now provides a unified batched inference method batch_decode. It performs CTC greedy decoding for a batch of short-form or long-form audios. If an audio is shorter than 30s, it will be padded to 30s; otherwise it will be split into overlapped segments (same as the "long-form ASR/ST" method below).

from espnet2.bin.s2t_inference_ctc import Speech2TextGreedySearch

s2t = Speech2TextGreedySearch.from_pretrained(
    "espnet/owsm_ctc_v3.1_1B",
    device="cuda",
    use_flash_attn=False,   # set to True for better efficiency if flash attn is installed and dtype is float16 or bfloat16
    lang_sym='<eng>',
    task_sym='<asr>',
)

res = s2t.batch_decode(
    "audio.wav",    # a single audio (path or 1-D array/tensor) as input
    batch_size=16,
    context_len_in_secs=4,
)   # res is a single str, i.e., the predicted text without special tokens

res = s2t.batch_decode(
    ["audio1.wav", "audio2.wav", "audio3.wav"], # a list of audios as input
    batch_size=16,
    context_len_in_secs=4,
)   # res is a list of str

# Please check the code of `batch_decode` for all supported inputs

Example script for short-form ASR/ST/LID

Our models are trained on 16kHz audio with a fixed duration of 30s. When using the pre-trained model, please ensure the input speech is 16kHz and pad or truncate it to 30s.

import librosa
from espnet2.bin.s2t_inference_ctc import Speech2TextGreedySearch

s2t = Speech2TextGreedySearch.from_pretrained(
    "espnet/owsm_ctc_v3.1_1B",
    device="cuda",
    generate_interctc_outputs=False,
    lang_sym='<eng>',
    task_sym='<asr>',
)

# NOTE: OWSM-CTC is trained on 16kHz audio with a fixed 30s duration. Please ensure your input has the correct sample rate; otherwise resample it to 16k before feeding it to the model
speech, rate = librosa.load("xxx.wav", sr=16000)
speech = librosa.util.fix_length(speech, size=(16000 * 30))

res = s2t(speech)[0]
print(res)

Example script for long-form ASR/ST

import soundfile as sf
import torch
from espnet2.bin.s2t_inference_ctc import Speech2TextGreedySearch

context_len_in_secs = 4   # left and right context when doing buffered inference
batch_size = 32   # depends on the GPU memory
s2t = Speech2TextGreedySearch.from_pretrained(
    "espnet/owsm_ctc_v3.1_1B",
    device='cuda' if torch.cuda.is_available() else 'cpu',
    generate_interctc_outputs=False,
    lang_sym='<eng>',
    task_sym='<asr>',
)

speech, rate = sf.read(
    "xxx.wav"
)

text = s2t.decode_long_batched_buffered(
    speech,
    batch_size=batch_size,
    context_len_in_secs=context_len_in_secs,
)
print(text)

Example of CTC forced alignment using `ctc-segmentation`

CTC segmentation can be efficiently applied to audio of an arbitrary length.

import soundfile as sf
from espnet2.bin.s2t_ctc_align import CTCSegmentation
from espnet_model_zoo.downloader import ModelDownloader

# Download model first
d = ModelDownloader()
downloaded = d.download_and_unpack("espnet/owsm_ctc_v3.1_1B")

aligner = CTCSegmentation(
    **downloaded,
    fs=16000,
    ngpu=1,
    batch_size=32,    # batched parallel decoding; reduce it if your GPU memory is smaller
    kaldi_style_text=True,
    time_stamps="auto",     # "auto" can be more accurate than "fixed" when converting token index to timestamp
    lang_sym="<eng>",
    task_sym="<asr>",
    context_len_in_secs=2,  # left and right context in buffered decoding
)

speech, rate = sf.read(
    "./test_utils/ctc_align_test.wav"
)
print(f"speech duration: {len(speech) / rate : .2f} seconds")
text = """
utt1 THE SALE OF THE HOTELS
utt2 IS PART OF HOLIDAY'S STRATEGY
utt3 TO SELL OFF ASSETS
utt4 AND CONCENTRATE ON PROPERTY MANAGEMENT
"""

segments = aligner(speech, text)
print(segments)

📄 License

This project is licensed under the cc-by-4.0 license.

📚 Documentation

Information Table

Property	Details
Tags	espnet, audio, automatic-speech-recognition, speech-translation, language-identification
Language	multilingual
Datasets	owsm_v3.1_ctc
Metrics	cer, bleu, accuracy
Library Name	espnet

Citations

OWSM-CTC

@inproceedings{owsm-ctc,
    title = "{OWSM}-{CTC}: An Open Encoder-Only Speech Foundation Model for Speech Recognition, Translation, and Language Identification",
    author = "Peng, Yifan  and
      Sudo, Yui  and
      Shakeel, Muhammad  and
      Watanabe, Shinji",
    booktitle = "Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL)",
    year = "2024",
    month= {8},
    url = "https://aclanthology.org/2024.acl-long.549",
}

OWSM v3.1 and v3.2

@inproceedings{owsm-v32,
  title={On the Effects of Heterogeneous Data Sources on Speech-to-Text Foundation Models},
  author={Jinchuan Tian and Yifan Peng and William Chen and Kwanghee Choi and Karen Livescu and Shinji Watanabe},
  booktitle={Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH)},
  year={2024},
  month={9},
  pdf="https://arxiv.org/pdf/2406.09282"
}
@inproceedings{owsm-v31,
  title={{OWSM v3.1: Better and Faster Open Whisper-Style Speech Models based on E-Branchformer}},
  author={Yifan Peng and Jinchuan Tian and William Chen and Siddhant Arora and Brian Yan and Yui Sudo and Muhammad Shakeel and Kwanghee Choi and Jiatong Shi and Xuankai Chang and Jee-weon Jung and Shinji Watanabe},
  booktitle={Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH)},
  year={2024},
  month={9},
  pdf="https://arxiv.org/pdf/2401.16658",
}

Initial OWSM (v1, v2, v3)

@inproceedings{owsm,
  title={Reproducing Whisper-Style Training Using An Open-Source Toolkit And Publicly Available Data},
  author={Yifan Peng and Jinchuan Tian and Brian Yan and Dan Berrebbi and Xuankai Chang and Xinjian Li and Jiatong Shi and Siddhant Arora and William Chen and Roshan Sharma and Wangyou Zhang and Yui Sudo and Muhammad Shakeel and Jee-weon Jung and Soumi Maiti and Shinji Watanabe},
  booktitle={Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)},
  year={2023},
  month={12},
  pdf="https://arxiv.org/pdf/2309.13876",
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご