owsm_ctc_v3.2_ft_1B开源语音模型 - 免费实现多语言识别、翻译和语言判定

首页

Owsm Ctc V3.2 Ft 1B

由 espnet 开发

OWSM-CTC是基于分层多任务自条件CTC的仅编码器语音基础模型，支持多语言语音识别、语音翻译和语言识别。

语音识别其他#多任务语音处理 #多语言支持 #长音频分割解码

下载量 110

发布时间 : 9/24/2024

模型简介

该模型在180k小时的公开音频数据上训练，支持多语言语音识别、任意到任意语音翻译和语言识别，是开放Whisper风格语音模型(OWSM)项目的一部分。

模型特点

多任务支持

同时支持语音识别、语音翻译和语言识别三种任务

大规模训练

基于180k小时的公开音频数据训练

高效推理

提供批量推理和长音频处理能力

CTC强制对齐

支持使用ctc-segmentation进行音频与文本的对齐

模型能力

多语言语音识别

任意到任意语音翻译

语言识别

长音频处理

批量推理

使用案例

语音转写

会议记录自动转写

将会议录音自动转换为文字记录

支持多种语言的准确转写

语音翻译

实时语音翻译

将一种语言的语音实时翻译为另一种语言的文字

支持任意语言对之间的翻译

音频分析

语言识别

识别音频中使用的语言

可识别多种语言

🚀 OWSM-CTC语音基础模型

OWSM-CTC是一个基于分层多任务自条件CTC的仅编码器语音基础模型，可用于多语言语音识别、任意到任意的语音翻译和语言识别。它使用了18万小时的公共音频数据进行训练，遵循Open Whisper-style Speech Model (OWSM)项目的设计。

🚀 快速开始

本模型使用OWSM-CTC v3.1进行初始化，然后在v3.2数据上进行了22.5万步的微调。要使用预训练模型，请安装espnet和espnet_model_zoo，所需的依赖如下：

librosa
torch
espnet
espnet_model_zoo

相关脚本可在ESPnet中找到：https://github.com/espnet/espnet/tree/master/egs2/owsm_ctc_v3.1/s2t1

✨ 主要特性

支持多语言语音识别、任意到任意的语音翻译和语言识别。
提供统一的批量推理方法，可处理短音频和长音频。
支持CTC强制对齐。

📦 安装指南

要使用预训练模型，需要安装espnet和espnet_model_zoo，所需依赖如下：

librosa
torch
espnet
espnet_model_zoo

💻 使用示例

基础用法

批量推理示例脚本

Speech2TextGreedySearch现在提供了统一的批量推理方法batch_decode，可对一批短音频或长音频进行CTC贪心解码。如果音频短于30秒，将填充至30秒；否则将其分割为重叠的片段。

from espnet2.bin.s2t_inference_ctc import Speech2TextGreedySearch

s2t = Speech2TextGreedySearch.from_pretrained(
    "espnet/owsm_ctc_v3.2_ft_1B",
    device="cuda",
    use_flash_attn=False,   # set to True for better efficiency if flash attn is installed and dtype is float16 or bfloat16
    lang_sym='<eng>',
    task_sym='<asr>',
)

res = s2t.batch_decode(
    "audio.wav",    # a single audio (path or 1-D array/tensor) as input
    batch_size=16,
    context_len_in_secs=4,
)   # res is a single str, i.e., the predicted text without special tokens

res = s2t.batch_decode(
    ["audio1.wav", "audio2.wav", "audio3.wav"], # a list of audios as input
    batch_size=16,
    context_len_in_secs=4,
)   # res is a list of str

# Please check the code of `batch_decode` for all supported inputs

短音频ASR/ST/LID示例脚本

我们的模型在16kHz、固定时长30秒的音频上进行训练。使用预训练模型时，请确保输入语音为16kHz，并将其填充或截断至30秒。

import librosa
from espnet2.bin.s2t_inference_ctc import Speech2TextGreedySearch

s2t = Speech2TextGreedySearch.from_pretrained(
    "espnet/owsm_ctc_v3.2_ft_1B",
    device="cuda",
    generate_interctc_outputs=False,
    lang_sym='<eng>',
    task_sym='<asr>',
)

# NOTE: OWSM-CTC is trained on 16kHz audio with a fixed 30s duration. Please ensure your input has the correct sample rate; otherwise resample it to 16k before feeding it to the model
speech, rate = librosa.load("xxx.wav", sr=16000)
speech = librosa.util.fix_length(speech, size=(16000 * 30))

res = s2t(speech)[0]
print(res)

长音频ASR/ST示例脚本

import soundfile as sf
import torch
from espnet2.bin.s2t_inference_ctc import Speech2TextGreedySearch

context_len_in_secs = 4   # left and right context when doing buffered inference
batch_size = 32   # depends on the GPU memory
s2t = Speech2TextGreedySearch.from_pretrained(
    "espnet/owsm_ctc_v3.2_ft_1B",
    device='cuda' if torch.cuda.is_available() else 'cpu',
    generate_interctc_outputs=False,
    lang_sym='<eng>',
    task_sym='<asr>',
)

speech, rate = sf.read(
    "xxx.wav"
)

text = s2t.decode_long_batched_buffered(
    speech,
    batch_size=batch_size,
    context_len_in_secs=context_len_in_secs,
)
print(text)

使用`ctc-segmentation`进行CTC强制对齐示例

CTC分割可以高效地应用于任意长度的音频。

import soundfile as sf
from espnet2.bin.s2t_ctc_align import CTCSegmentation
from espnet_model_zoo.downloader import ModelDownloader

# Download model first
d = ModelDownloader()
downloaded = d.download_and_unpack("espnet/owsm_ctc_v3.2_ft_1B")

aligner = CTCSegmentation(
    **downloaded,
    fs=16000,
    ngpu=1,
    batch_size=32,    # batched parallel decoding; reduce it if your GPU memory is smaller
    kaldi_style_text=True,
    time_stamps="auto",     # "auto" can be more accurate than "fixed" when converting token index to timestamp
    lang_sym="<eng>",
    task_sym="<asr>",
    context_len_in_secs=2,  # left and right context in buffered decoding
)

speech, rate = sf.read(
    "./test_utils/ctc_align_test.wav"
)
print(f"speech duration: {len(speech) / rate : .2f} seconds")
text = """
utt1 THE SALE OF THE HOTELS
utt2 IS PART OF HOLIDAY'S STRATEGY
utt3 TO SELL OFF ASSETS
utt4 AND CONCENTRATE ON PROPERTY MANAGEMENT
"""

segments = aligner(speech, text)
print(segments)

📚 详细文档

模型标签：espnet、audio、automatic-speech-recognition、speech-translation、language-identification
支持语言：多语言
训练数据集：owsm_v3.2_ctc
基础模型：espnet/owsm_ctc_v3.2_ft_1B
许可证：cc-by-4.0

📄 许可证

本项目采用cc-by-4.0许可证。

📚 引用

OWSM-CTC

@inproceedings{owsm-ctc,
    title = "{OWSM}-{CTC}: An Open Encoder-Only Speech Foundation Model for Speech Recognition, Translation, and Language Identification",
    author = "Peng, Yifan  and
      Sudo, Yui  and
      Shakeel, Muhammad  and
      Watanabe, Shinji",
    booktitle = "Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL)",
    year = "2024",
    month= {8},
    url = "https://aclanthology.org/2024.acl-long.549",
}

OWSM v3.1 and v3.2

@inproceedings{owsm-v32,
  title={On the Effects of Heterogeneous Data Sources on Speech-to-Text Foundation Models},
  author={Jinchuan Tian and Yifan Peng and William Chen and Kwanghee Choi and Karen Livescu and Shinji Watanabe},
  booktitle={Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH)},
  year={2024},
  month={9},
  pdf="https://arxiv.org/pdf/2406.09282"
}
@inproceedings{owsm-v31,
  title={{OWSM v3.1: Better and Faster Open Whisper-Style Speech Models based on E-Branchformer}},
  author={Yifan Peng and Jinchuan Tian and William Chen and Siddhant Arora and Brian Yan and Yui Sudo and Muhammad Shakeel and Kwanghee Choi and Jiatong Shi and Xuankai Chang and Jee-weon Jung and Shinji Watanabe},
  booktitle={Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH)},
  year={2024},
  month={9},
  pdf="https://arxiv.org/pdf/2401.16658",
}

初始OWSM (v1, v2, v3)

@inproceedings{owsm,
  title={Reproducing Whisper-Style Training Using An Open-Source Toolkit And Publicly Available Data},
  author={Yifan Peng and Jinchuan Tian and Brian Yan and Dan Berrebbi and Xuankai Chang and Xinjian Li and Jiatong Shi and Siddhant Arora and William Chen and Roshan Sharma and Wangyou Zhang and Yui Sudo and Muhammad Shakeel and Jee-weon Jung and Soumi Maiti and Shinji Watanabe},
  booktitle={Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)},
  year={2023},
  month={12},
  pdf="https://arxiv.org/pdf/2309.13876",
}