MERT-v1-330M开源音乐理解模型 - 免费支持音乐信息检索任务！

首页

MERT V1 95M

由 m-a-p 开发

MERT-v1-330M 是一个基于 MLM 范式训练的高级音乐理解模型，具有 330M 参数，支持 24K Hz 音频采样率和 75 Hz 特征率，适用于多种音乐信息检索任务。

音频分类

Transformers

#音乐理解 #高采样率 #大规模预训练

下载量 83.72k

发布时间 : 3/17/2023

模型简介

MERT-v1-330M 是一个音乐音频预训练模型，采用 MLM 范式训练，具有更强的任务泛化能力和更高的音频采样率，适用于音乐分类、音乐生成等任务。

模型特点

高音频采样率

支持 24K Hz 音频采样率，提供更高质量的音频处理能力。

大规模训练数据

使用 160K 小时的音乐数据进行训练，模型具有更强的泛化能力。

多码本伪标签

采用 encodec 的 8 码本伪标签，提升质量并支持音乐生成任务。

批内噪声混合

引入批内噪声混合的 MLM 预测，增强模型的鲁棒性。

模型能力

音乐分类

音乐信息检索

音乐生成

使用案例

音乐分析

音乐风格分类

对音乐片段进行风格分类，如流行、古典、爵士等。

在多个下游任务中表现优于前代模型。

音乐情感识别

识别音乐中的情感特征，如快乐、悲伤、愤怒等。

音乐生成

音乐片段生成

基于输入的音频特征生成新的音乐片段。

🚀 音乐音频预训练（m - a - p）模型系列介绍

本项目聚焦于音乐音频预训练（m - a - p）模型家族的开发，旨在为音乐领域的相关任务提供强大的支持。通过一系列的模型迭代和优化，不断提升模型在音乐理解、生成等方面的性能。

🚀 快速开始

模型开发日志

2023年6月2日：arxiv预印本和训练代码发布。
2023年3月17日：发布两个先进的音乐理解模型 [MERT - v1 - 95M](https://huggingface.co/m - a - p/MERT - v1 - 95M) 和 [MERT - v1 - 330M](https://huggingface.co/m - a - p/MERT - v1 - 330M)，采用新的预训练范式和数据集进行训练。它们的性能优于之前的模型，并且能更好地泛化到更多任务。
2023年3月14日：使用仅开源的音乐数据集重新训练 MERT - v0 模型，得到 [MERT - v0 - public](https://huggingface.co/m - a - p/MERT - v0 - public)。
2022年12月29日：发布使用 MLM 范式训练的音乐理解模型 [MERT - v0](https://huggingface.co/m - a - p/MERT - v0)，在下游任务中表现更好。
2022年10月29日：发布使用 BYOL 范式训练的预训练 MIR 模型 [music2vec](https://huggingface.co/m - a - p/music2vec - v1)。

模型快速选择表

属性	详情
模型类型	包含 MERT - v1 - 330M、MERT - v1 - 95M、MERT - v0 - public、MERT - v0、music2vec - v1 等
训练数据	涵盖不同时长的音频数据，如 160K 小时、20K 小时、900 小时、1000 小时等

名称	预训练范式	训练数据（小时）	预训练上下文（秒）	模型大小	变压器层数 - 维度	特征速率	采样率	发布日期
[MERT - v1 - 330M](https://huggingface.co/m - a - p/MERT - v1 - 330M)	MLM	160K	5	330M	24 - 1024	75 Hz	24K Hz	2023年3月17日
[MERT - v1 - 95M](https://huggingface.co/m - a - p/MERT - v1 - 95M)	MLM	20K	5	95M	12 - 768	75 Hz	24K Hz	2023年3月17日
[MERT - v0 - public](https://huggingface.co/m - a - p/MERT - v0 - public)	MLM	900	5	95M	12 - 768	50 Hz	16K Hz	2023年3月14日
[MERT - v0](https://huggingface.co/m - a - p/MERT - v0)	MLM	1000	5	95 M	12 - 768	50 Hz	16K Hz	2022年12月29日
[music2vec - v1](https://huggingface.co/m - a - p/music2vec - v1)	BYOL	1000	30	95 M	12 - 768	50 Hz	16K Hz	2022年10月30日

✨ 主要特性

模型解释

m - a - p 模型具有相似的架构，最显著的区别在于预训练中使用的范式。此外，在使用前还需要了解以下几个细微的技术配置：

模型大小：即加载到内存中的参数数量。请根据您的硬件选择合适的大小。
变压器层数 - 维度：变压器层数和模型可以输出的相应特征维度。这一点需要特别注意，因为 不同层提取的特征在不同任务中可能有不同的性能。
特征速率：给定 1 秒的音频输入，模型输出的特征数量。
采样率：模型训练所使用的音频频率。

MERT - v1 介绍

与 MERT - v0 相比，MERT - v1 在预训练中引入了多个新特性：

将伪标签更改为来自 encodec 的 8 个码本，这可能具有更高的质量，并使模型支持音乐生成。
使用批内噪声混合进行 MLM 预测。
使用更高的音频频率（24K Hz）进行训练。
使用更多的音频数据（最多 160000 小时）进行训练。
提供更多可用的模型大小，如 95M 和 330M。

更多细节将在即将发布的论文中详细阐述。

💻 使用示例

基础用法

# from transformers import Wav2Vec2Processor
from transformers import Wav2Vec2FeatureExtractor
from transformers import AutoModel
import torch
from torch import nn
import torchaudio.transforms as T
from datasets import load_dataset


# loading our model weights
model = AutoModel.from_pretrained("m-a-p/MERT-v1-95M", trust_remote_code=True)
# loading the corresponding preprocessor config
processor = Wav2Vec2FeatureExtractor.from_pretrained("m-a-p/MERT-v1-95M",trust_remote_code=True)

# load demo audio and set processor
dataset = load_dataset("hf-internal-testing/librispeech_asr_demo", "clean", split="validation")
dataset = dataset.sort("id")
sampling_rate = dataset.features["audio"].sampling_rate

resample_rate = processor.sampling_rate
# make sure the sample_rate aligned
if resample_rate != sampling_rate:
    print(f'setting rate from {sampling_rate} to {resample_rate}')
    resampler = T.Resample(sampling_rate, resample_rate)
else:
    resampler = None

# audio file is decoded on the fly
if resampler is None:
    input_audio = dataset[0]["audio"]["array"]
else:
    input_audio = resampler(torch.from_numpy(dataset[0]["audio"]["array"]))
  
inputs = processor(input_audio, sampling_rate=resample_rate, return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs, output_hidden_states=True)

# take a look at the output shape, there are 13 layers of representation
# each layer performs differently in different downstream tasks, you should choose empirically
all_layer_hidden_states = torch.stack(outputs.hidden_states).squeeze()
print(all_layer_hidden_states.shape) # [13 layer, Time steps, 768 feature_dim]

# for utterance level classification tasks, you can simply reduce the representation in time
time_reduced_hidden_states = all_layer_hidden_states.mean(-2)
print(time_reduced_hidden_states.shape) # [13, 768]

# you can even use a learnable weighted average representation
aggregator = nn.Conv1d(in_channels=13, out_channels=1, kernel_size=1)
weighted_avg_hidden_states = aggregator(time_reduced_hidden_states.unsqueeze(0)).squeeze()
print(weighted_avg_hidden_states.shape) # [768]

📄 许可证

本项目采用 CC - BY - NC - 4.0 许可证。

📚 引用

@misc{li2023mert,
      title={MERT: Acoustic Music Understanding Model with Large-Scale Self-supervised Training}, 
      author={Yizhi Li and Ruibin Yuan and Ge Zhang and Yinghao Ma and Xingran Chen and Hanzhi Yin and Chenghua Lin and Anton Ragni and Emmanouil Benetos and Norbert Gyenge and Roger Dannenberg and Ruibo Liu and Wenhu Chen and Gus Xia and Yemin Shi and Wenhao Huang and Yike Guo and Jie Fu},
      year={2023},
      eprint={2306.00107},
      archivePrefix={arXiv},
      primaryClass={cs.SD}
}