dasheng-1.2B开源音频编码器 - 捕捉语音、音乐、环境音多领域音频信息

首页

Dasheng 1.2B

由 mispeech 开发

大声是一个基于大规模自监督学习训练的通用音频编码器，能够捕捉跨语音、音乐和环境音等多领域的丰富音频信息。

音频分类

Transformers

开源协议:Apache-2.0 #大规模音频编码 #多领域音频分类 #自监督学习

下载量 135

发布时间 : 6/6/2024

模型简介

大声是一个12亿参数规模的通用音频编码器，通过272,356小时的多样化音频训练，在语音、音乐和环境音分类任务中表现优异。

模型特点

大规模训练

使用272,356小时的多样化音频数据进行训练

多领域适用

能够处理语音、音乐和环境音等多种音频类型

高性能

在HEAR基准测试中超越先前成果，在多个任务上表现优异

通用编码器

可提取适用于多种下游任务的音频嵌入特征

模型能力

音频特征提取

语音分类

音乐分类

环境音分类

音频嵌入生成

使用案例

语音处理

语音命令识别

识别短语音命令

在Speech Commands任务上表现优异

说话人计数

统计音频中的说话人数量

在LibriCount任务上取得良好效果

音乐分析

音乐分类

对音乐片段进行分类

在音乐分类任务中表现优异

环境音分析

环境音识别

识别环境中的各种声音

在环境音分类任务中表现良好

🚀 大声（Dasheng）：大规模通用音频编码器

大声（Dasheng，Deep Audio-Signal Holistic Embeddings），或“大声”（中文意为“响亮的声音”），是一个在大规模自监督学习任务上训练的通用音频编码器。大声旨在捕捉包括语音、音乐和环境声音等各个领域的丰富音频信息。该模型在272,356小时的多样化音频数据上进行训练，拥有12亿个参数，并在HEAR基准测试中展现出显著的性能提升。在CREMA - D、LibriCount、语音命令、VoxLingua等任务上，大声超越了以往的工作成果，并且在音乐和环境声音分类任务中也表现出色。

原始仓库地址：https://github.com/XiaoMi/dasheng

dasheng

🚀 快速开始

✨ 主要特性

通用音频编码：能够处理多种类型的音频，包括语音、音乐和环境声音。
大规模训练：在272,356小时的多样化音频数据上训练，拥有12亿个参数。
性能优越：在HEAR基准测试和多个音频分类任务中表现出色。

📦 安装指南

pip install git+https://github.com/jimbozhang/hf_transformers_custom_model_dasheng.git

💻 使用示例

基础用法

>>> model_name = "mispeech/dasheng-1.2B"

>>> from dasheng_model.feature_extraction_dasheng import DashengFeatureExtractor
>>> from dasheng_model.modeling_dasheng import DashengModel

>>> feature_extractor = DashengFeatureExtractor.from_pretrained(model_name)
>>> model = DashengModel.from_pretrained(model_name, outputdim=None)  # no linear output layer if `outputdim` is `None`

>>> import torchaudio
>>> audio, sampling_rate = torchaudio.load("resources/JeD5V5aaaoI_931_932.wav")
>>> assert sampling_rate == 16000
>>> audio.shape
torch.Size([1, 16000])   # mono audio of 1 second

>>> inputs = feature_extractor(audio, sampling_rate=sampling_rate, return_tensors="pt")
>>> inputs.input_values.shape
torch.Size([1, 64, 101])   # 64 mel-filterbanks, 101 frames

>>> import torch
>>> with torch.no_grad():
...     outputs = model(**inputs)

>>> outputs.hidden_states.shape
torch.Size([1, 25, 768])   # 25 T-F patches (patch size 64x4, no overlap), before mean-pooling

>>> outputs.logits.shape
torch.Size([1, 768])   # mean-pooled embedding (would be logits from a linear layer if `outputdim` was set)

高级用法

在ESC - 50数据集上微调模型：点击下面的链接在Colab中打开示例代码：

具体代码可参考：example_finetune_esc50.ipynb，该示例展示了如何在冻结大声编码器的情况下，在ESC - 50数据集上训练一个线性头部。

📄 许可证

本项目采用Apache - 2.0许可证。

📚 详细文档

如果您在研究中发现大声模型很有用，请引用以下论文：

@inproceedings{dinkel2023scaling,
  title={Scaling up masked audio encoder learning for general audio classification},
  author={Dinkel, Heinrich and Yan, Zhiyong and Wang, Yongqing and Zhang, Junbo and Wang, Yujun and Wang, Bin},
  booktitle={Interspeech 2024},
  year={2024}
}