owsm_v3.1_ebf Open-source Speech Model - Free Support for Multilingual Speech Recognition and Translation

Owsm V3.1 Ebf

Developed by espnet

OWSM is an open-source Whisper-style speech model developed based on publicly available data and the ESPnet toolkit, supporting multilingual speech recognition, translation, and other tasks.

Speech Recognition Other#Multilingual Speech-to-Text #Open-Source Speech Foundation Model #E-Branchformer Encoder

Downloads 291

Release Time : 12/22/2023

Model Overview

OWSM aims to develop fully open speech foundation models using publicly available data and open-source toolkits, supporting various tasks such as speech recognition, cross-language speech translation, sentence-level alignment, long-text transcription, and language identification.

Model Features

Open-Source Speech Foundation Model

Developed entirely using publicly available data and open-source toolkits, ensuring transparency and reproducibility.

Improved Speech Encoder

Utilizes the advanced E-Branchformer encoder, significantly improving performance compared to previous versions.

Multi-Task Support

A single model supports multiple tasks such as speech recognition, translation, alignment, long-text transcription, and language identification.

Large-Scale Training Data

Trained on 180,000 hours of publicly available speech data, covering multiple languages and scenarios.

Model Capabilities

Speech Recognition

Cross-Language Speech Translation

Sentence-Level Alignment

Long-Text Transcription

Language Identification

Use Cases

Speech-to-Text

Multilingual Speech Recognition

Convert speech in multiple languages into corresponding text

Supports high-quality multilingual transcription

Speech Translation

Directly translate speech from one language into text in another language

Enables real-time cross-language translation

Speech Analysis

Language Identification

Automatically identify the language type in speech

Accurately identifies multiple languages

Speech Alignment

Align speech with text temporally

Generates precise speech-text alignment information

🚀 OWSM: Open Whisper-style Speech Model

OWSM aims to develop fully open speech foundation models using publicly available data and open-source toolkits, including ESPnet. It provides solutions for various speech - related tasks, offering high - performance and accessible speech processing capabilities.

Inference examples can be found on our project page. Our demo is available here.

OWSM v3.1 is an improved version of OWSM v3. It significantly outperforms OWSM v3 in almost all evaluation benchmarks. We do not include any new training data. Instead, we utilize a state - of - the - art speech encoder, E - Branchformer.

The model in this repo has 1.02B parameters in total and is trained on 180k hours of public speech data. Specifically, it supports the following speech - to - text tasks:

Speech recognition
Any - to - any - language speech translation
Utterance - level alignment
Long - form transcription
Language identification

📚 Documentation

Model Information

Property	Details
Model Type	Open Whisper - style Speech Model
Training Data	180k hours of public speech data from owsm_v3.1
License	cc - by - 4.0
Tags	espnet, audio, automatic - speech - recognition, speech - translation
Language	multilingual

📄 License

The model is released under the cc - by - 4.0 license.

📚 Citations

OWSM - CTC

@inproceedings{owsm-ctc,
    title = "{OWSM}-{CTC}: An Open Encoder-Only Speech Foundation Model for Speech Recognition, Translation, and Language Identification",
    author = "Peng, Yifan  and
      Sudo, Yui  and
      Shakeel, Muhammad  and
      Watanabe, Shinji",
    booktitle = "Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL)",
    year = "2024",
    month= {8},
    url = "https://aclanthology.org/2024.acl-long.549",
}

OWSM v3.1 and v3.2

@inproceedings{owsm-v32,
  title={On the Effects of Heterogeneous Data Sources on Speech-to-Text Foundation Models},
  author={Jinchuan Tian and Yifan Peng and William Chen and Kwanghee Choi and Karen Livescu and Shinji Watanabe},
  booktitle={Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH)},
  year={2024},
  month={9},
  pdf="https://arxiv.org/pdf/2406.09282"
}
@inproceedings{owsm-v31,
  title={{OWSM v3.1: Better and Faster Open Whisper-Style Speech Models based on E-Branchformer}},
  author={Yifan Peng and Jinchuan Tian and William Chen and Siddhant Arora and Brian Yan and Yui Sudo and Muhammad Shakeel and Kwanghee Choi and Jiatong Shi and Xuankai Chang and Jee-weon Jung and Shinji Watanabe},
  booktitle={Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH)},
  year={2024},
  month={9},
  pdf="https://arxiv.org/pdf/2401.16658",
}

Initial OWSM (v1, v2, v3)

@inproceedings{owsm,
  title={Reproducing Whisper-Style Training Using An Open-Source Toolkit And Publicly Available Data},
  author={Yifan Peng and Jinchuan Tian and Brian Yan and Dan Berrebbi and Xuankai Chang and Xinjian Li and Jiatong Shi and Siddhant Arora and William Chen and Roshan Sharma and Wangyou Zhang and Yui Sudo and Muhammad Shakeel and Jee-weon Jung and Soumi Maiti and Shinji Watanabe},
  booktitle={Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)},
  year={2023},
  month={12},
  pdf="https://arxiv.org/pdf/2309.13876",
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご