owsm_ctc_v3.2_ft_1B開源語音模型 - 免費實現多語言識別、翻譯和語言判定

首頁

Owsm Ctc V3.2 Ft 1B

由espnet開發

OWSM-CTC是基於分層多任務自條件CTC的僅編碼器語音基礎模型，支持多語言語音識別、語音翻譯和語言識別。

語音識別其他#多任務語音處理 #多語言支持 #長音頻分割解碼

下載量 110

發布時間 : 9/24/2024

模型概述

該模型在180k小時的公開音頻數據上訓練，支持多語言語音識別、任意到任意語音翻譯和語言識別，是開放Whisper風格語音模型(OWSM)項目的一部分。

模型特點

多任務支持

同時支持語音識別、語音翻譯和語言識別三種任務

大規模訓練

基於180k小時的公開音頻數據訓練

高效推理

提供批量推理和長音頻處理能力

CTC強制對齊

支持使用ctc-segmentation進行音頻與文本的對齊

模型能力

多語言語音識別

任意到任意語音翻譯

語言識別

長音頻處理

批量推理

使用案例

語音轉寫

會議記錄自動轉寫

將會議錄音自動轉換為文字記錄

支持多種語言的準確轉寫

語音翻譯

即時語音翻譯

將一種語言的語音即時翻譯為另一種語言的文字

支持任意語言對之間的翻譯

音頻分析

語言識別

識別音頻中使用的語言

可識別多種語言

🚀 OWSM-CTC語音基礎模型

OWSM-CTC是一個基於分層多任務自條件CTC的僅編碼器語音基礎模型，可用於多語言語音識別、任意到任意的語音翻譯和語言識別。它使用了18萬小時的公共音頻數據進行訓練，遵循Open Whisper-style Speech Model (OWSM)項目的設計。

🚀 快速開始

本模型使用OWSM-CTC v3.1進行初始化，然後在v3.2數據上進行了22.5萬步的微調。要使用預訓練模型，請安裝espnet和espnet_model_zoo，所需的依賴如下：

librosa
torch
espnet
espnet_model_zoo

相關腳本可在ESPnet中找到：https://github.com/espnet/espnet/tree/master/egs2/owsm_ctc_v3.1/s2t1

✨ 主要特性

支持多語言語音識別、任意到任意的語音翻譯和語言識別。
提供統一的批量推理方法，可處理短音頻和長音頻。
支持CTC強制對齊。

📦 安裝指南

要使用預訓練模型，需要安裝espnet和espnet_model_zoo，所需依賴如下：

librosa
torch
espnet
espnet_model_zoo

💻 使用示例

基礎用法

批量推理示例腳本

Speech2TextGreedySearch現在提供了統一的批量推理方法batch_decode，可對一批短音頻或長音頻進行CTC貪心解碼。如果音頻短於30秒，將填充至30秒；否則將其分割為重疊的片段。

from espnet2.bin.s2t_inference_ctc import Speech2TextGreedySearch

s2t = Speech2TextGreedySearch.from_pretrained(
    "espnet/owsm_ctc_v3.2_ft_1B",
    device="cuda",
    use_flash_attn=False,   # set to True for better efficiency if flash attn is installed and dtype is float16 or bfloat16
    lang_sym='<eng>',
    task_sym='<asr>',
)

res = s2t.batch_decode(
    "audio.wav",    # a single audio (path or 1-D array/tensor) as input
    batch_size=16,
    context_len_in_secs=4,
)   # res is a single str, i.e., the predicted text without special tokens

res = s2t.batch_decode(
    ["audio1.wav", "audio2.wav", "audio3.wav"], # a list of audios as input
    batch_size=16,
    context_len_in_secs=4,
)   # res is a list of str

# Please check the code of `batch_decode` for all supported inputs

短音頻ASR/ST/LID示例腳本

我們的模型在16kHz、固定時長30秒的音頻上進行訓練。使用預訓練模型時，請確保輸入語音為16kHz，並將其填充或截斷至30秒。

import librosa
from espnet2.bin.s2t_inference_ctc import Speech2TextGreedySearch

s2t = Speech2TextGreedySearch.from_pretrained(
    "espnet/owsm_ctc_v3.2_ft_1B",
    device="cuda",
    generate_interctc_outputs=False,
    lang_sym='<eng>',
    task_sym='<asr>',
)

# NOTE: OWSM-CTC is trained on 16kHz audio with a fixed 30s duration. Please ensure your input has the correct sample rate; otherwise resample it to 16k before feeding it to the model
speech, rate = librosa.load("xxx.wav", sr=16000)
speech = librosa.util.fix_length(speech, size=(16000 * 30))

res = s2t(speech)[0]
print(res)

長音頻ASR/ST示例腳本

import soundfile as sf
import torch
from espnet2.bin.s2t_inference_ctc import Speech2TextGreedySearch

context_len_in_secs = 4   # left and right context when doing buffered inference
batch_size = 32   # depends on the GPU memory
s2t = Speech2TextGreedySearch.from_pretrained(
    "espnet/owsm_ctc_v3.2_ft_1B",
    device='cuda' if torch.cuda.is_available() else 'cpu',
    generate_interctc_outputs=False,
    lang_sym='<eng>',
    task_sym='<asr>',
)

speech, rate = sf.read(
    "xxx.wav"
)

text = s2t.decode_long_batched_buffered(
    speech,
    batch_size=batch_size,
    context_len_in_secs=context_len_in_secs,
)
print(text)

使用`ctc-segmentation`進行CTC強制對齊示例

CTC分割可以高效地應用於任意長度的音頻。

import soundfile as sf
from espnet2.bin.s2t_ctc_align import CTCSegmentation
from espnet_model_zoo.downloader import ModelDownloader

# Download model first
d = ModelDownloader()
downloaded = d.download_and_unpack("espnet/owsm_ctc_v3.2_ft_1B")

aligner = CTCSegmentation(
    **downloaded,
    fs=16000,
    ngpu=1,
    batch_size=32,    # batched parallel decoding; reduce it if your GPU memory is smaller
    kaldi_style_text=True,
    time_stamps="auto",     # "auto" can be more accurate than "fixed" when converting token index to timestamp
    lang_sym="<eng>",
    task_sym="<asr>",
    context_len_in_secs=2,  # left and right context in buffered decoding
)

speech, rate = sf.read(
    "./test_utils/ctc_align_test.wav"
)
print(f"speech duration: {len(speech) / rate : .2f} seconds")
text = """
utt1 THE SALE OF THE HOTELS
utt2 IS PART OF HOLIDAY'S STRATEGY
utt3 TO SELL OFF ASSETS
utt4 AND CONCENTRATE ON PROPERTY MANAGEMENT
"""

segments = aligner(speech, text)
print(segments)

📚 詳細文檔

模型標籤：espnet、audio、automatic-speech-recognition、speech-translation、language-identification
支持語言：多語言
訓練數據集：owsm_v3.2_ctc
基礎模型：espnet/owsm_ctc_v3.2_ft_1B
許可證：cc-by-4.0

📄 許可證

本項目採用cc-by-4.0許可證。

📚 引用

OWSM-CTC

@inproceedings{owsm-ctc,
    title = "{OWSM}-{CTC}: An Open Encoder-Only Speech Foundation Model for Speech Recognition, Translation, and Language Identification",
    author = "Peng, Yifan  and
      Sudo, Yui  and
      Shakeel, Muhammad  and
      Watanabe, Shinji",
    booktitle = "Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL)",
    year = "2024",
    month= {8},
    url = "https://aclanthology.org/2024.acl-long.549",
}

OWSM v3.1 and v3.2

@inproceedings{owsm-v32,
  title={On the Effects of Heterogeneous Data Sources on Speech-to-Text Foundation Models},
  author={Jinchuan Tian and Yifan Peng and William Chen and Kwanghee Choi and Karen Livescu and Shinji Watanabe},
  booktitle={Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH)},
  year={2024},
  month={9},
  pdf="https://arxiv.org/pdf/2406.09282"
}
@inproceedings{owsm-v31,
  title={{OWSM v3.1: Better and Faster Open Whisper-Style Speech Models based on E-Branchformer}},
  author={Yifan Peng and Jinchuan Tian and William Chen and Siddhant Arora and Brian Yan and Yui Sudo and Muhammad Shakeel and Kwanghee Choi and Jiatong Shi and Xuankai Chang and Jee-weon Jung and Shinji Watanabe},
  booktitle={Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH)},
  year={2024},
  month={9},
  pdf="https://arxiv.org/pdf/2401.16658",
}

初始OWSM (v1, v2, v3)

@inproceedings{owsm,
  title={Reproducing Whisper-Style Training Using An Open-Source Toolkit And Publicly Available Data},
  author={Yifan Peng and Jinchuan Tian and Brian Yan and Dan Berrebbi and Xuankai Chang and Xinjian Li and Jiatong Shi and Siddhant Arora and William Chen and Roshan Sharma and Wangyou Zhang and Yui Sudo and Muhammad Shakeel and Jee-weon Jung and Soumi Maiti and Shinji Watanabe},
  booktitle={Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)},
  year={2023},
  month={12},
  pdf="https://arxiv.org/pdf/2309.13876",
}