owsm_ctc_v3.1_1Bオープンソース音声モデル - 多言語音声認識、翻訳と言語認識をサポート

ホーム

Owsm Ctc V3.1 1B

espnetによって開発

OWSM-CTCは、階層型マルチタスク自己条件付きCTCに基づく純粋なエンコーダー音声基盤モデルで、多言語音声認識、音声翻訳、言語識別をサポートします。

音声認識その他#多言語音声認識 #任意音声翻訳 #純エンコーダーアーキテクチャ

ダウンロード数 116

リリース時間 : 2/23/2024

モデル概要

このモデルは180k時間の公開音声データでトレーニングされ、オープンWhisperスタイル音声モデル(OWSM)プロジェクトの設計に従い、多言語音声認識、任意から任意への音声翻訳、言語識別をサポートします。

モデル特徴

マルチタスク学習

音声認識、音声翻訳、言語識別の3つのタスクをサポート

大規模トレーニング

180k時間の公開音声データでトレーニング

効率的な推論

バッチ推論と長音声処理能力を提供

CTC強制アライメント

ctc-segmentationを使用した効率的なタイムスタンプアライメントをサポート

モデル能力

多言語音声認識

任意から任意への音声翻訳

言語識別

バッチ音声処理

長音声分割処理

CTCタイムスタンプアライメント

使用事例

音声テキスト化

会議議事録の転記

会議録音を文字記録に変換

高精度な転記テキスト

音声翻訳

リアルタイム音声翻訳

ある言語の音声を別の言語のテキストにリアルタイムで翻訳

スムーズな異言語コミュニケーション

音声分析

言語識別

音声中の言語タイプを識別

正確な言語分類

🚀 OWSM-CTCモデル

OWSM-CTCは、階層的多タスク自己条件付きCTCに基づくエンコーダ専用の音声基礎モデルです。このモデルは、多言語音声認識、任意の言語間の音声翻訳、および言語識別のために、18万時間の公開音声データで訓練されています。

🚀 クイックスタート

このモデルを使用するには、espnet と espnet_model_zoo をインストールする必要があります。必要なライブラリは以下の通りです。

librosa
torch
espnet
espnet_model_zoo

レシピはESPnetのこちらのリポジトリにあります: https://github.com/espnet/espnet/tree/master/egs2/owsm_ctc_v3.1/s2t1

✨ 主な機能

多言語音声認識、任意の言語間の音声翻訳、および言語識別に対応。
バッチ推論機能を提供し、短時間または長時間の音声に対応。
CTC強制アラインメント機能を備えています。

📦 インストール

このモデルを使用するには、以下のライブラリをインストールする必要があります。

librosa
torch
espnet
espnet_model_zoo

💻 使用例

基本的な使用法

バッチ推論の例

Speech2TextGreedySearch は、バッチ推論メソッド batch_decode を提供しています。これは、短時間または長時間の音声バッチに対してCTC貪欲復号を行います。

from espnet2.bin.s2t_inference_ctc import Speech2TextGreedySearch

s2t = Speech2TextGreedySearch.from_pretrained(
    "espnet/owsm_ctc_v3.1_1B",
    device="cuda",
    use_flash_attn=False,   # set to True for better efficiency if flash attn is installed and dtype is float16 or bfloat16
    lang_sym='<eng>',
    task_sym='<asr>',
)

res = s2t.batch_decode(
    "audio.wav",    # a single audio (path or 1-D array/tensor) as input
    batch_size=16,
    context_len_in_secs=4,
)   # res is a single str, i.e., the predicted text without special tokens

res = s2t.batch_decode(
    ["audio1.wav", "audio2.wav", "audio3.wav"], # a list of audios as input
    batch_size=16,
    context_len_in_secs=4,
)   # res is a list of str

# Please check the code of `batch_decode` for all supported inputs

短時間音声のASR/ST/LIDの例

このモデルは16kHz、30秒の音声で訓練されています。入力音声は16kHzで、30秒にパディングまたはトランケートする必要があります。

import librosa
from espnet2.bin.s2t_inference_ctc import Speech2TextGreedySearch

s2t = Speech2TextGreedySearch.from_pretrained(
    "espnet/owsm_ctc_v3.1_1B",
    device="cuda",
    generate_interctc_outputs=False,
    lang_sym='<eng>',
    task_sym='<asr>',
)

# NOTE: OWSM-CTC is trained on 16kHz audio with a fixed 30s duration. Please ensure your input has the correct sample rate; otherwise resample it to 16k before feeding it to the model
speech, rate = librosa.load("xxx.wav", sr=16000)
speech = librosa.util.fix_length(speech, size=(16000 * 30))

res = s2t(speech)[0]
print(res)

長時間音声のASR/STの例

import soundfile as sf
import torch
from espnet2.bin.s2t_inference_ctc import Speech2TextGreedySearch

context_len_in_secs = 4   # left and right context when doing buffered inference
batch_size = 32   # depends on the GPU memory
s2t = Speech2TextGreedySearch.from_pretrained(
    "espnet/owsm_ctc_v3.1_1B",
    device='cuda' if torch.cuda.is_available() else 'cpu',
    generate_interctc_outputs=False,
    lang_sym='<eng>',
    task_sym='<asr>',
)

speech, rate = sf.read(
    "xxx.wav"
)

text = s2t.decode_long_batched_buffered(
    speech,
    batch_size=batch_size,
    context_len_in_secs=context_len_in_secs,
)
print(text)

CTC強制アラインメントの例

import soundfile as sf
from espnet2.bin.s2t_ctc_align import CTCSegmentation
from espnet_model_zoo.downloader import ModelDownloader

# Download model first
d = ModelDownloader()
downloaded = d.download_and_unpack("espnet/owsm_ctc_v3.1_1B")

aligner = CTCSegmentation(
    **downloaded,
    fs=16000,
    ngpu=1,
    batch_size=32,    # batched parallel decoding; reduce it if your GPU memory is smaller
    kaldi_style_text=True,
    time_stamps="auto",     # "auto" can be more accurate than "fixed" when converting token index to timestamp
    lang_sym="<eng>",
    task_sym="<asr>",
    context_len_in_secs=2,  # left and right context in buffered decoding
)

speech, rate = sf.read(
    "./test_utils/ctc_align_test.wav"
)
print(f"speech duration: {len(speech) / rate : .2f} seconds")
text = """
utt1 THE SALE OF THE HOTELS
utt2 IS PART OF HOLIDAY'S STRATEGY
utt3 TO SELL OFF ASSETS
utt4 AND CONCENTRATE ON PROPERTY MANAGEMENT
"""

segments = aligner(speech, text)
print(segments)

📚 ドキュメント

モデルの詳細な情報は、論文 OWSM-CTC: An Open Encoder-Only Speech Foundation Model for Speech Recognition, Translation, and Language Identification を参照してください。
レシピはESPnetのこちらのリポジトリにあります: https://github.com/espnet/espnet/tree/master/egs2/owsm_ctc_v3.1/s2t1

📄 ライセンス

このモデルはCC BY 4.0ライセンスの下で提供されています。

📚 引用

OWSM-CTC

@inproceedings{owsm-ctc,
    title = "{OWSM}-{CTC}: An Open Encoder-Only Speech Foundation Model for Speech Recognition, Translation, and Language Identification",
    author = "Peng, Yifan  and
      Sudo, Yui  and
      Shakeel, Muhammad  and
      Watanabe, Shinji",
    booktitle = "Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL)",
    year = "2024",
    month= {8},
    url = "https://aclanthology.org/2024.acl-long.549",
}

OWSM v3.1 and v3.2

@inproceedings{owsm-v32,
  title={On the Effects of Heterogeneous Data Sources on Speech-to-Text Foundation Models},
  author={Jinchuan Tian and Yifan Peng and William Chen and Kwanghee Choi and Karen Livescu and Shinji Watanabe},
  booktitle={Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH)},
  year={2024},
  month={9},
  pdf="https://arxiv.org/pdf/2406.09282"
}
@inproceedings{owsm-v31,
  title={{OWSM v3.1: Better and Faster Open Whisper-Style Speech Models based on E-Branchformer}},
  author={Yifan Peng and Jinchuan Tian and William Chen and Siddhant Arora and Brian Yan and Yui Sudo and Muhammad Shakeel and Kwanghee Choi and Jiatong Shi and Xuankai Chang and Jee-weon Jung and Shinji Watanabe},
  booktitle={Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH)},
  year={2024},
  month={9},
  pdf="https://arxiv.org/pdf/2401.16658",
}

Initial OWSM (v1, v2, v3)

@inproceedings{owsm,
  title={Reproducing Whisper-Style Training Using An Open-Source Toolkit And Publicly Available Data},
  author={Yifan Peng and Jinchuan Tian and Brian Yan and Dan Berrebbi and Xuankai Chang and Xinjian Li and Jiatong Shi and Siddhant Arora and William Chen and Roshan Sharma and Wangyou Zhang and Yui Sudo and Muhammad Shakeel and Jee-weon Jung and Soumi Maiti and Shinji Watanabe},
  booktitle={Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)},
  year={2023},
  month={12},
  pdf="https://arxiv.org/pdf/2309.13876",
}