Owsm Ctc V3.1 1B
模型简介
该模型在180k小时的公开音频数据上训练,遵循开放Whisper风格语音模型(OWSM)项目的设计,支持多语言语音识别、任意到任意语音翻译和语言识别。
模型特点
多任务学习
支持语音识别、语音翻译和语言识别三种任务
大规模训练
在180k小时的公开音频数据上训练
高效推理
提供批量推理和长音频处理能力
CTC强制对齐
支持使用ctc-segmentation进行高效的时间戳对齐
模型能力
多语言语音识别
任意到任意语音翻译
语言识别
批量音频处理
长音频分割处理
CTC时间戳对齐
使用案例
语音转写
会议记录转录
将会议录音转换为文字记录
高准确率的转录文本
语音翻译
实时语音翻译
将一种语言的语音实时翻译为另一种语言的文本
流畅的跨语言沟通
音频分析
语言识别
识别音频中的语言类型
准确的语言分类
🚀 OWSM-CTC语音基础模型
OWSM-CTC是一个基于分层多任务自条件CTC的仅编码器语音基础模型,可用于多语言语音识别、任意到任意的语音翻译和语言识别。它在180k小时的公共音频数据上进行训练,遵循Open Whisper-style Speech Model (OWSM)项目的设计。
🚀 快速开始
要使用预训练模型,请安装espnet
和espnet_model_zoo
,所需依赖如下:
librosa
torch
espnet
espnet_model_zoo
该模型的配方可在ESPnet中找到: https://github.com/espnet/espnet/tree/master/egs2/owsm_ctc_v3.1/s2t1
✨ 主要特性
- 基于分层多任务自条件CTC的仅编码器架构。
- 在180k小时的公共音频数据上进行训练,支持多语言语音识别、任意到任意的语音翻译和语言识别。
- 提供统一的批量推理方法,支持短音频和长音频的处理。
📦 安装指南
安装所需的库,依赖列表如下:
librosa
torch
espnet
espnet_model_zoo
💻 使用示例
基础用法
批量推理示例脚本
Speech2TextGreedySearch
现在提供了统一的批量推理方法batch_decode
,它可以对一批短音频或长音频进行CTC贪心解码。如果音频短于30秒,将被填充到30秒;否则,将被分割成重叠的片段。
from espnet2.bin.s2t_inference_ctc import Speech2TextGreedySearch
s2t = Speech2TextGreedySearch.from_pretrained(
"espnet/owsm_ctc_v3.1_1B",
device="cuda",
use_flash_attn=False, # set to True for better efficiency if flash attn is installed and dtype is float16 or bfloat16
lang_sym='<eng>',
task_sym='<asr>',
)
res = s2t.batch_decode(
"audio.wav", # a single audio (path or 1-D array/tensor) as input
batch_size=16,
context_len_in_secs=4,
) # res is a single str, i.e., the predicted text without special tokens
res = s2t.batch_decode(
["audio1.wav", "audio2.wav", "audio3.wav"], # a list of audios as input
batch_size=16,
context_len_in_secs=4,
) # res is a list of str
# Please check the code of `batch_decode` for all supported inputs
短音频语音识别/翻译/语言识别示例脚本
我们的模型在16kHz、固定时长为30秒的音频上进行训练。使用预训练模型时,请确保输入语音为16kHz,并将其填充或截断为30秒。
import librosa
from espnet2.bin.s2t_inference_ctc import Speech2TextGreedySearch
s2t = Speech2TextGreedySearch.from_pretrained(
"espnet/owsm_ctc_v3.1_1B",
device="cuda",
generate_interctc_outputs=False,
lang_sym='<eng>',
task_sym='<asr>',
)
# NOTE: OWSM-CTC is trained on 16kHz audio with a fixed 30s duration. Please ensure your input has the correct sample rate; otherwise resample it to 16k before feeding it to the model
speech, rate = librosa.load("xxx.wav", sr=16000)
speech = librosa.util.fix_length(speech, size=(16000 * 30))
res = s2t(speech)[0]
print(res)
长音频语音识别/翻译示例脚本
import soundfile as sf
import torch
from espnet2.bin.s2t_inference_ctc import Speech2TextGreedySearch
context_len_in_secs = 4 # left and right context when doing buffered inference
batch_size = 32 # depends on the GPU memory
s2t = Speech2TextGreedySearch.from_pretrained(
"espnet/owsm_ctc_v3.1_1B",
device='cuda' if torch.cuda.is_available() else 'cpu',
generate_interctc_outputs=False,
lang_sym='<eng>',
task_sym='<asr>',
)
speech, rate = sf.read(
"xxx.wav"
)
text = s2t.decode_long_batched_buffered(
speech,
batch_size=batch_size,
context_len_in_secs=context_len_in_secs,
)
print(text)
使用ctc-segmentation
进行CTC强制对齐示例
CTC分割可以高效地应用于任意长度的音频。
import soundfile as sf
from espnet2.bin.s2t_ctc_align import CTCSegmentation
from espnet_model_zoo.downloader import ModelDownloader
# Download model first
d = ModelDownloader()
downloaded = d.download_and_unpack("espnet/owsm_ctc_v3.1_1B")
aligner = CTCSegmentation(
**downloaded,
fs=16000,
ngpu=1,
batch_size=32, # batched parallel decoding; reduce it if your GPU memory is smaller
kaldi_style_text=True,
time_stamps="auto", # "auto" can be more accurate than "fixed" when converting token index to timestamp
lang_sym="<eng>",
task_sym="<asr>",
context_len_in_secs=2, # left and right context in buffered decoding
)
speech, rate = sf.read(
"./test_utils/ctc_align_test.wav"
)
print(f"speech duration: {len(speech) / rate : .2f} seconds")
text = """
utt1 THE SALE OF THE HOTELS
utt2 IS PART OF HOLIDAY'S STRATEGY
utt3 TO SELL OFF ASSETS
utt4 AND CONCENTRATE ON PROPERTY MANAGEMENT
"""
segments = aligner(speech, text)
print(segments)
📚 详细文档
模型信息
属性 | 详情 |
---|---|
模型类型 | 基于分层多任务自条件CTC的仅编码器语音基础模型 |
训练数据 | 180k小时的公共音频数据 |
评估指标 | CER、BLEU、准确率 |
许可证 | CC BY 4.0 |
引用信息
OWSM-CTC
@inproceedings{owsm-ctc,
title = "{OWSM}-{CTC}: An Open Encoder-Only Speech Foundation Model for Speech Recognition, Translation, and Language Identification",
author = "Peng, Yifan and
Sudo, Yui and
Shakeel, Muhammad and
Watanabe, Shinji",
booktitle = "Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL)",
year = "2024",
month= {8},
url = "https://aclanthology.org/2024.acl-long.549",
}
OWSM v3.1 and v3.2
@inproceedings{owsm-v32,
title={On the Effects of Heterogeneous Data Sources on Speech-to-Text Foundation Models},
author={Jinchuan Tian and Yifan Peng and William Chen and Kwanghee Choi and Karen Livescu and Shinji Watanabe},
booktitle={Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH)},
year={2024},
month={9},
pdf="https://arxiv.org/pdf/2406.09282"
}
@inproceedings{owsm-v31,
title={{OWSM v3.1: Better and Faster Open Whisper-Style Speech Models based on E-Branchformer}},
author={Yifan Peng and Jinchuan Tian and William Chen and Siddhant Arora and Brian Yan and Yui Sudo and Muhammad Shakeel and Kwanghee Choi and Jiatong Shi and Xuankai Chang and Jee-weon Jung and Shinji Watanabe},
booktitle={Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH)},
year={2024},
month={9},
pdf="https://arxiv.org/pdf/2401.16658",
}
Initial OWSM (v1, v2, v3)
@inproceedings{owsm,
title={Reproducing Whisper-Style Training Using An Open-Source Toolkit And Publicly Available Data},
author={Yifan Peng and Jinchuan Tian and Brian Yan and Dan Berrebbi and Xuankai Chang and Xinjian Li and Jiatong Shi and Siddhant Arora and William Chen and Roshan Sharma and Wangyou Zhang and Yui Sudo and Muhammad Shakeel and Jee-weon Jung and Soumi Maiti and Shinji Watanabe},
booktitle={Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)},
year={2023},
month={12},
pdf="https://arxiv.org/pdf/2309.13876",
}
📄 许可证
本项目采用CC BY 4.0许可证。详情请见:https://creativecommons.org/licenses/by/4.0/
Voice Activity Detection
MIT
基于pyannote.audio 2.1版本的语音活动检测模型,用于识别音频中的语音活动时间段
语音识别
V
pyannote
7.7M
181
Wav2vec2 Large Xlsr 53 Portuguese
Apache-2.0
这是一个针对葡萄牙语语音识别任务微调的XLSR-53大模型,基于Common Voice 6.1数据集训练,支持葡萄牙语语音转文本。
语音识别 其他
W
jonatasgrosman
4.9M
32
Whisper Large V3
Apache-2.0
Whisper是由OpenAI提出的先进自动语音识别(ASR)和语音翻译模型,在超过500万小时的标注数据上训练,具有强大的跨数据集和跨领域泛化能力。
语音识别 支持多种语言
W
openai
4.6M
4,321
Whisper Large V3 Turbo
MIT
Whisper是由OpenAI开发的最先进的自动语音识别(ASR)和语音翻译模型,经过超过500万小时标记数据的训练,在零样本设置下展现出强大的泛化能力。
语音识别
Transformers 支持多种语言

W
openai
4.0M
2,317
Wav2vec2 Large Xlsr 53 Russian
Apache-2.0
基于facebook/wav2vec2-large-xlsr-53模型微调的俄语语音识别模型,支持16kHz采样率的语音输入
语音识别 其他
W
jonatasgrosman
3.9M
54
Wav2vec2 Large Xlsr 53 Chinese Zh Cn
Apache-2.0
基于facebook/wav2vec2-large-xlsr-53模型微调的中文语音识别模型,支持16kHz采样率的语音输入。
语音识别 中文
W
jonatasgrosman
3.8M
110
Wav2vec2 Large Xlsr 53 Dutch
Apache-2.0
基于facebook/wav2vec2-large-xlsr-53微调的荷兰语语音识别模型,在Common Voice和CSS10数据集上训练,支持16kHz音频输入。
语音识别 其他
W
jonatasgrosman
3.0M
12
Wav2vec2 Large Xlsr 53 Japanese
Apache-2.0
基于facebook/wav2vec2-large-xlsr-53模型微调的日语语音识别模型,支持16kHz采样率的语音输入
语音识别 日语
W
jonatasgrosman
2.9M
33
Mms 300m 1130 Forced Aligner
基于Hugging Face预训练模型的文本与音频强制对齐工具,支持多种语言,内存效率高
语音识别
Transformers 支持多种语言

M
MahmoudAshraf
2.5M
50
Wav2vec2 Large Xlsr 53 Arabic
Apache-2.0
基于facebook/wav2vec2-large-xlsr-53微调的阿拉伯语语音识别模型,在Common Voice和阿拉伯语语音语料库上训练
语音识别 阿拉伯语
W
jonatasgrosman
2.3M
37
精选推荐AI模型
Llama 3 Typhoon V1.5x 8b Instruct
专为泰语设计的80亿参数指令模型,性能媲美GPT-3.5-turbo,优化了应用场景、检索增强生成、受限生成和推理任务
大型语言模型
Transformers 支持多种语言

L
scb10x
3,269
16
Cadet Tiny
Openrail
Cadet-Tiny是一个基于SODA数据集训练的超小型对话模型,专为边缘设备推理设计,体积仅为Cosmo-3B模型的2%左右。
对话系统
Transformers 英语

C
ToddGoldfarb
2,691
6
Roberta Base Chinese Extractive Qa
基于RoBERTa架构的中文抽取式问答模型,适用于从给定文本中提取答案的任务。
问答系统 中文
R
uer
2,694
98