MERT-v1-330M开源音乐理解模型 - 支持多音乐信息检索任务，免费部署

首页

MERT V1 330M

由 m-a-p 开发

MERT-v1-330M是一个基于MLM范式训练的高级音乐理解模型，具有330M参数规模，支持24K Hz音频采样率，适用于多种音乐信息检索任务。

音频分类

Transformers

#音乐理解 #高采样率音频 #多码本伪标签

下载量 16.92k

发布时间 : 3/17/2023

模型简介

该模型采用掩码语言建模(MLM)预训练范式，通过大规模音乐数据集(160,000小时)训练，具备优秀的音乐特征提取和理解能力，适用于音乐分类、音乐生成等下游任务。

模型特点

大规模预训练

使用160,000小时音乐数据进行训练，覆盖广泛的音乐风格和类型

高音频质量处理

支持24K Hz高采样率音频输入，能捕捉更丰富的音乐细节

改进的MLM范式

采用EnCodec的8码本伪标签和批内噪声混合技术，提升预训练效果

多任务泛化能力

在下游音乐理解任务中表现出优秀的泛化性能

模型能力

音乐特征提取

音乐风格分类

音乐情感识别

音乐生成支持

使用案例

音乐推荐系统

音乐风格分类

自动识别和分类音乐作品的风格特征

可用于个性化音乐推荐系统的前端处理

音乐内容分析

音乐情感分析

分析音乐作品表达的情感特征

适用于音乐治疗、情绪识别等应用场景

🚀 音乐音频预训练（m - a - p）模型系列介绍

本项目的音乐音频预训练（m - a - p）模型家族致力于解决音乐理解和处理相关问题，通过不同的预训练范式和技术配置，为音乐领域的各类任务提供强大的支持，具有广泛的应用价值。

🚀 快速开始

模型开发日志

2023年6月2日：arxiv预印本和训练代码发布。
2023年3月17日：发布两个先进的音乐理解模型，[MERT - v1 - 95M](https://huggingface.co/m - a - p/MERT - v1 - 95M)和[MERT - v1 - 330M](https://huggingface.co/m - a - p/MERT - v1 - 330M)，采用新范式和数据集进行训练。它们优于之前的模型，能更好地泛化到更多任务。
2023年3月14日：使用仅开源的音乐数据集重新训练MERT - v0模型，得到[MERT - v0 - public](https://huggingface.co/m - a - p/MERT - v0 - public)。
2022年12月29日：发布一个使用MLM范式训练的音乐理解模型[MERT - v0](https://huggingface.co/m - a - p/MERT - v0)，在下游任务中表现更好。
2022年10月29日：发布一个使用BYOL范式训练的预训练MIR模型[music2vec](https://huggingface.co/m - a - p/music2vec - v1)。

模型快速选择表

属性	详情
模型类型	包括[MERT - v1 - 330M](https://huggingface.co/m - a - p/MERT - v1 - 330M)、[MERT - v1 - 95M](https://huggingface.co/m - a - p/MERT - v1 - 95M)、[MERT - v0 - public](https://huggingface.co/m - a - p/MERT - v0 - public)、[MERT - v0](https://huggingface.co/m - a - p/MERT - v0)、[music2vec - v1](https://huggingface.co/m - a - p/music2vec - v1)
预训练范式	有MLM和BYOL两种
训练数据（小时）	从900小时到160000小时不等
预训练上下文（秒）	主要为5秒，music2vec - v1为30秒
模型大小	有95M和330M等不同规格
Transformer层 - 维度	如24 - 1024、12 - 768等
特征速率	50 Hz或75 Hz
采样率	16K Hz或24K Hz
发布日期	从2022年10月30日到2023年6月2日

✨ 主要特性

模型共性与差异

m - a - p模型具有相似的模型架构，最显著的区别在于预训练中使用的范式。此外，在使用前还需要了解以下几个细微的技术配置：

模型大小：指加载到内存中的参数数量。请选择适合您硬件的适当大小。
Transformer层 - 维度：模型可以输出的Transformer层数和相应的特征维度。这一点需要注意，因为不同层提取的特征在不同任务中可能有不同的表现。
特征速率：给定1秒的音频输入，模型输出的特征数量。
采样率：模型训练所使用的音频频率。

MERT - v1的新特性

与MERT - v0相比，MERT - v1在预训练中引入了多个新特性：

将伪标签更改为来自encodec的8个码本，可能具有更高的质量，并使模型支持音乐生成。
采用批量内噪声混合进行MLM预测。
使用更高的音频频率（24K Hz）进行训练。
使用更多的音频数据（最多160000小时）进行训练。
提供更多可用的模型大小，如95M和330M。

💻 使用示例

基础用法

# from transformers import Wav2Vec2Processor
from transformers import Wav2Vec2FeatureExtractor
from transformers import AutoModel
import torch
from torch import nn
import torchaudio.transforms as T
from datasets import load_dataset

# loading our model weights
model = AutoModel.from_pretrained("m-a-p/MERT-v1-330M", trust_remote_code=True)
# loading the corresponding preprocessor config
processor = Wav2Vec2FeatureExtractor.from_pretrained("m-a-p/MERT-v1-330M",trust_remote_code=True)

# load demo audio and set processor
dataset = load_dataset("hf-internal-testing/librispeech_asr_demo", "clean", split="validation")
dataset = dataset.sort("id")
sampling_rate = dataset.features["audio"].sampling_rate

resample_rate = processor.sampling_rate
# make sure the sample_rate aligned
if resample_rate != sampling_rate:
    print(f'setting rate from {sampling_rate} to {resample_rate}')
    resampler = T.Resample(sampling_rate, resample_rate)
else:
    resampler = None

# audio file is decoded on the fly
if resampler is None:
    input_audio = dataset[0]["audio"]["array"]
else:
    input_audio = resampler(torch.from_numpy(dataset[0]["audio"]["array"]))
  
inputs = processor(input_audio, sampling_rate=resample_rate, return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs, output_hidden_states=True)

# take a look at the output shape, there are 25 layers of representation
# each layer performs differently in different downstream tasks, you should choose empirically
all_layer_hidden_states = torch.stack(outputs.hidden_states).squeeze()
print(all_layer_hidden_states.shape) # [25 layer, Time steps, 1024 feature_dim]

# for utterance level classification tasks, you can simply reduce the representation in time
time_reduced_hidden_states = all_layer_hidden_states.mean(-2)
print(time_reduced_hidden_states.shape) # [25, 1024]

# you can even use a learnable weighted average representation
aggregator = nn.Conv1d(in_channels=25, out_channels=1, kernel_size=1)
weighted_avg_hidden_states = aggregator(time_reduced_hidden_states.unsqueeze(0)).squeeze()
print(weighted_avg_hidden_states.shape) # [1024]

📚 详细文档

更多详细信息将在我们即将发布的论文中介绍。

📄 许可证

本项目采用CC - BY - NC - 4.0许可证。

📖 引用

@misc{li2023mert,
      title={MERT: Acoustic Music Understanding Model with Large-Scale Self-supervised Training}, 
      author={Yizhi Li and Ruibin Yuan and Ge Zhang and Yinghao Ma and Xingran Chen and Hanzhi Yin and Chenghua Lin and Anton Ragni and Emmanouil Benetos and Norbert Gyenge and Roger Dannenberg and Ruibo Liu and Wenhu Chen and Gus Xia and Yemin Shi and Wenhao Huang and Yike Guo and Jie Fu},
      year={2023},
      eprint={2306.00107},
      archivePrefix={arXiv},
      primaryClass={cs.SD}
}