owsm_ctc_v3.2_ft_1Bオープンソース音声モデル - 無料で多言語識別、翻訳、言語判定を実現

ホーム

Owsm Ctc V3.2 Ft 1B

espnetによって開発

OWSM-CTCは階層型マルチタスク自己条件付きCTCに基づくエンコーダ専用音声基礎モデルで、多言語音声認識、音声翻訳、言語識別をサポートします。

音声認識その他#マルチタスク音声処理 #多言語サポート #長時間音声分割デコード

ダウンロード数 110

リリース時間 : 9/24/2024

モデル概要

このモデルは180k時間の公開音声データでトレーニングされ、多言語音声認識、任意から任意への音声翻訳、言語識別をサポートし、オープンWhisperスタイル音声モデル(OWSM)プロジェクトの一部です。

モデル特徴

マルチタスクサポート

音声認識、音声翻訳、言語識別の3つのタスクを同時にサポート

大規模トレーニング

180k時間の公開音声データに基づくトレーニング

効率的な推論

バッチ推論と長時間音声処理能力を提供

CTC強制アライメント

ctc-segmentationを使用した音声とテキストのアライメントをサポート

モデル能力

多言語音声認識

任意から任意への音声翻訳

言語識別

長時間音声処理

バッチ推論

使用事例

音声文字起こし

会議議事録自動文字起こし

会議録音を自動的に文字記録に変換

複数言語の正確な文字起こしをサポート

音声翻訳

リアルタイム音声翻訳

ある言語の音声を別の言語のテキストにリアルタイム翻訳

任意の言語間の翻訳をサポート

音声分析

言語識別

音声で使用されている言語を識別

複数言語を識別可能

🚀 OWSM-CTC モデル

OWSM-CTCは、階層的マルチタスク自己条件付きCTCに基づくエンコーダのみの音声基礎モデルです。多言語音声認識、任意言語間の音声翻訳、および言語識別に使用されます。

🚀 クイックスタート

この事前学習モデルを使用するには、espnet と espnet_model_zoo をインストールする必要があります。必要なライブラリは以下の通りです。

librosa
torch
espnet
espnet_model_zoo

レシピはESPnetで確認できます: リンク

✨ 主な機能

多言語音声認識、任意言語間の音声翻訳、および言語識別に対応
180k時間の公開音声データで学習
バッチ推論機能を提供

📦 インストール

事前学習モデルを使用するには、以下のライブラリをインストールしてください。

librosa
torch
espnet
espnet_model_zoo

💻 使用例

基本的な使用法

バッチ推論の例

from espnet2.bin.s2t_inference_ctc import Speech2TextGreedySearch

s2t = Speech2TextGreedySearch.from_pretrained(
    "espnet/owsm_ctc_v3.2_ft_1B",
    device="cuda",
    use_flash_attn=False,   # set to True for better efficiency if flash attn is installed and dtype is float16 or bfloat16
    lang_sym='<eng>',
    task_sym='<asr>',
)

res = s2t.batch_decode(
    "audio.wav",    # a single audio (path or 1-D array/tensor) as input
    batch_size=16,
    context_len_in_secs=4,
)   # res is a single str, i.e., the predicted text without special tokens

res = s2t.batch_decode(
    ["audio1.wav", "audio2.wav", "audio3.wav"], # a list of audios as input
    batch_size=16,
    context_len_in_secs=4,
)   # res is a list of str

# Please check the code of `batch_decode` for all supported inputs

短い音声のASR/ST/LIDの例

import librosa
from espnet2.bin.s2t_inference_ctc import Speech2TextGreedySearch

s2t = Speech2TextGreedySearch.from_pretrained(
    "espnet/owsm_ctc_v3.2_ft_1B",
    device="cuda",
    generate_interctc_outputs=False,
    lang_sym='<eng>',
    task_sym='<asr>',
)

# NOTE: OWSM-CTC is trained on 16kHz audio with a fixed 30s duration. Please ensure your input has the correct sample rate; otherwise resample it to 16k before feeding it to the model
speech, rate = librosa.load("xxx.wav", sr=16000)
speech = librosa.util.fix_length(speech, size=(16000 * 30))

res = s2t(speech)[0]
print(res)

長い音声のASR/STの例

import soundfile as sf
import torch
from espnet2.bin.s2t_inference_ctc import Speech2TextGreedySearch

context_len_in_secs = 4   # left and right context when doing buffered inference
batch_size = 32   # depends on the GPU memory
s2t = Speech2TextGreedySearch.from_pretrained(
    "espnet/owsm_ctc_v3.2_ft_1B",
    device='cuda' if torch.cuda.is_available() else 'cpu',
    generate_interctc_outputs=False,
    lang_sym='<eng>',
    task_sym='<asr>',
)

speech, rate = sf.read(
    "xxx.wav"
)

text = s2t.decode_long_batched_buffered(
    speech,
    batch_size=batch_size,
    context_len_in_secs=context_len_in_secs,
)
print(text)

CTC強制アラインメントの例

import soundfile as sf
from espnet2.bin.s2t_ctc_align import CTCSegmentation
from espnet_model_zoo.downloader import ModelDownloader

# Download model first
d = ModelDownloader()
downloaded = d.download_and_unpack("espnet/owsm_ctc_v3.2_ft_1B")

aligner = CTCSegmentation(
    **downloaded,
    fs=16000,
    ngpu=1,
    batch_size=32,    # batched parallel decoding; reduce it if your GPU memory is smaller
    kaldi_style_text=True,
    time_stamps="auto",     # "auto" can be more accurate than "fixed" when converting token index to timestamp
    lang_sym="<eng>",
    task_sym="<asr>",
    context_len_in_secs=2,  # left and right context in buffered decoding
)

speech, rate = sf.read(
    "./test_utils/ctc_align_test.wav"
)
print(f"speech duration: {len(speech) / rate : .2f} seconds")
text = """
utt1 THE SALE OF THE HOTELS
utt2 IS PART OF HOLIDAY'S STRATEGY
utt3 TO SELL OFF ASSETS
utt4 AND CONCENTRATE ON PROPERTY MANAGEMENT
"""

segments = aligner(speech, text)
print(segments)

📚 ドキュメント

モデルの詳細については、OWSM-CTC を参照してください。
データセットについては、owsm_v3.2_ctc を参照してください。

🔧 技術詳細

このモデルは、OWSM-CTC v3.1 を初期化し、v3.2データで225kステップのファインチューニングを行っています。

📄 ライセンス

このモデルは、CC-BY-4.0ライセンスの下で提供されています。

引用

OWSM-CTC

@inproceedings{owsm-ctc,
    title = "{OWSM}-{CTC}: An Open Encoder-Only Speech Foundation Model for Speech Recognition, Translation, and Language Identification",
    author = "Peng, Yifan  and
      Sudo, Yui  and
      Shakeel, Muhammad  and
      Watanabe, Shinji",
    booktitle = "Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL)",
    year = "2024",
    month= {8},
    url = "https://aclanthology.org/2024.acl-long.549",
}

OWSM v3.1 and v3.2

@inproceedings{owsm-v32,
  title={On the Effects of Heterogeneous Data Sources on Speech-to-Text Foundation Models},
  author={Jinchuan Tian and Yifan Peng and William Chen and Kwanghee Choi and Karen Livescu and Shinji Watanabe},
  booktitle={Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH)},
  year={2024},
  month={9},
  pdf="https://arxiv.org/pdf/2406.09282"
}
@inproceedings{owsm-v31,
  title={{OWSM v3.1: Better and Faster Open Whisper-Style Speech Models based on E-Branchformer}},
  author={Yifan Peng and Jinchuan Tian and William Chen and Siddhant Arora and Brian Yan and Yui Sudo and Muhammad Shakeel and Kwanghee Choi and Jiatong Shi and Xuankai Chang and Jee-weon Jung and Shinji Watanabe},
  booktitle={Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH)},
  year={2024},
  month={9},
  pdf="https://arxiv.org/pdf/2401.16658",
}

Initial OWSM (v1, v2, v3)

@inproceedings{owsm,
  title={Reproducing Whisper-Style Training Using An Open-Source Toolkit And Publicly Available Data},
  author={Yifan Peng and Jinchuan Tian and Brian Yan and Dan Berrebbi and Xuankai Chang and Xinjian Li and Jiatong Shi and Siddhant Arora and William Chen and Roshan Sharma and Wangyou Zhang and Yui Sudo and Muhammad Shakeel and Jee-weon Jung and Soumi Maiti and Shinji Watanabe},
  booktitle={Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)},
  year={2023},
  month={12},
  pdf="https://arxiv.org/pdf/2309.13876",
}