MERT-v1-330M開源音樂理解模型 - 免費支持音樂信息檢索任務！

首頁

MERT V1 95M

由m-a-p開發

MERT-v1-330M 是一個基於 MLM 範式訓練的高級音樂理解模型，具有 330M 參數，支持 24K Hz 音頻採樣率和 75 Hz 特徵率，適用於多種音樂信息檢索任務。

音頻分類

Transformers

#音樂理解 #高採樣率 #大規模預訓練

下載量 83.72k

發布時間 : 3/17/2023

模型概述

MERT-v1-330M 是一個音樂音頻預訓練模型，採用 MLM 範式訓練，具有更強的任務泛化能力和更高的音頻採樣率，適用於音樂分類、音樂生成等任務。

模型特點

高音頻採樣率

支持 24K Hz 音頻採樣率，提供更高質量的音頻處理能力。

大規模訓練數據

使用 160K 小時的音樂數據進行訓練，模型具有更強的泛化能力。

多碼本偽標籤

採用 encodec 的 8 碼本偽標籤，提升質量並支持音樂生成任務。

批內噪聲混合

引入批內噪聲混合的 MLM 預測，增強模型的魯棒性。

模型能力

音樂分類

音樂信息檢索

音樂生成

使用案例

音樂分析

音樂風格分類

對音樂片段進行風格分類，如流行、古典、爵士等。

在多個下游任務中表現優於前代模型。

音樂情感識別

識別音樂中的情感特徵，如快樂、悲傷、憤怒等。

音樂生成

音樂片段生成

基於輸入的音頻特徵生成新的音樂片段。

🚀 音樂音頻預訓練（m - a - p）模型系列介紹

本項目聚焦於音樂音頻預訓練（m - a - p）模型家族的開發，旨在為音樂領域的相關任務提供強大的支持。通過一系列的模型迭代和優化，不斷提升模型在音樂理解、生成等方面的性能。

🚀 快速開始

模型開發日誌

2023年6月2日：arxiv預印本和訓練代碼發佈。
2023年3月17日：發佈兩個先進的音樂理解模型 [MERT - v1 - 95M](https://huggingface.co/m - a - p/MERT - v1 - 95M) 和 [MERT - v1 - 330M](https://huggingface.co/m - a - p/MERT - v1 - 330M)，採用新的預訓練範式和數據集進行訓練。它們的性能優於之前的模型，並且能更好地泛化到更多任務。
2023年3月14日：使用僅開源的音樂數據集重新訓練 MERT - v0 模型，得到 [MERT - v0 - public](https://huggingface.co/m - a - p/MERT - v0 - public)。
2022年12月29日：發佈使用 MLM 範式訓練的音樂理解模型 [MERT - v0](https://huggingface.co/m - a - p/MERT - v0)，在下游任務中表現更好。
2022年10月29日：發佈使用 BYOL 範式訓練的預訓練 MIR 模型 [music2vec](https://huggingface.co/m - a - p/music2vec - v1)。

模型快速選擇表

屬性	詳情
模型類型	包含 MERT - v1 - 330M、MERT - v1 - 95M、MERT - v0 - public、MERT - v0、music2vec - v1 等
訓練數據	涵蓋不同時長的音頻數據，如 160K 小時、20K 小時、900 小時、1000 小時等

名稱	預訓練範式	訓練數據（小時）	預訓練上下文（秒）	模型大小	變壓器層數 - 維度	特徵速率	採樣率	發佈日期
[MERT - v1 - 330M](https://huggingface.co/m - a - p/MERT - v1 - 330M)	MLM	160K	5	330M	24 - 1024	75 Hz	24K Hz	2023年3月17日
[MERT - v1 - 95M](https://huggingface.co/m - a - p/MERT - v1 - 95M)	MLM	20K	5	95M	12 - 768	75 Hz	24K Hz	2023年3月17日
[MERT - v0 - public](https://huggingface.co/m - a - p/MERT - v0 - public)	MLM	900	5	95M	12 - 768	50 Hz	16K Hz	2023年3月14日
[MERT - v0](https://huggingface.co/m - a - p/MERT - v0)	MLM	1000	5	95 M	12 - 768	50 Hz	16K Hz	2022年12月29日
[music2vec - v1](https://huggingface.co/m - a - p/music2vec - v1)	BYOL	1000	30	95 M	12 - 768	50 Hz	16K Hz	2022年10月30日

✨ 主要特性

模型解釋

m - a - p 模型具有相似的架構，最顯著的區別在於預訓練中使用的範式。此外，在使用前還需要了解以下幾個細微的技術配置：

模型大小：即加載到內存中的參數數量。請根據您的硬件選擇合適的大小。
變壓器層數 - 維度：變壓器層數和模型可以輸出的相應特徵維度。這一點需要特別注意，因為 不同層提取的特徵在不同任務中可能有不同的性能。
特徵速率：給定 1 秒的音頻輸入，模型輸出的特徵數量。
採樣率：模型訓練所使用的音頻頻率。

MERT - v1 介紹

與 MERT - v0 相比，MERT - v1 在預訓練中引入了多個新特性：

將偽標籤更改為來自 encodec 的 8 個碼本，這可能具有更高的質量，並使模型支持音樂生成。
使用批內噪聲混合進行 MLM 預測。
使用更高的音頻頻率（24K Hz）進行訓練。
使用更多的音頻數據（最多 160000 小時）進行訓練。
提供更多可用的模型大小，如 95M 和 330M。

更多細節將在即將發佈的論文中詳細闡述。

💻 使用示例

基礎用法

# from transformers import Wav2Vec2Processor
from transformers import Wav2Vec2FeatureExtractor
from transformers import AutoModel
import torch
from torch import nn
import torchaudio.transforms as T
from datasets import load_dataset


# loading our model weights
model = AutoModel.from_pretrained("m-a-p/MERT-v1-95M", trust_remote_code=True)
# loading the corresponding preprocessor config
processor = Wav2Vec2FeatureExtractor.from_pretrained("m-a-p/MERT-v1-95M",trust_remote_code=True)

# load demo audio and set processor
dataset = load_dataset("hf-internal-testing/librispeech_asr_demo", "clean", split="validation")
dataset = dataset.sort("id")
sampling_rate = dataset.features["audio"].sampling_rate

resample_rate = processor.sampling_rate
# make sure the sample_rate aligned
if resample_rate != sampling_rate:
    print(f'setting rate from {sampling_rate} to {resample_rate}')
    resampler = T.Resample(sampling_rate, resample_rate)
else:
    resampler = None

# audio file is decoded on the fly
if resampler is None:
    input_audio = dataset[0]["audio"]["array"]
else:
    input_audio = resampler(torch.from_numpy(dataset[0]["audio"]["array"]))
  
inputs = processor(input_audio, sampling_rate=resample_rate, return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs, output_hidden_states=True)

# take a look at the output shape, there are 13 layers of representation
# each layer performs differently in different downstream tasks, you should choose empirically
all_layer_hidden_states = torch.stack(outputs.hidden_states).squeeze()
print(all_layer_hidden_states.shape) # [13 layer, Time steps, 768 feature_dim]

# for utterance level classification tasks, you can simply reduce the representation in time
time_reduced_hidden_states = all_layer_hidden_states.mean(-2)
print(time_reduced_hidden_states.shape) # [13, 768]

# you can even use a learnable weighted average representation
aggregator = nn.Conv1d(in_channels=13, out_channels=1, kernel_size=1)
weighted_avg_hidden_states = aggregator(time_reduced_hidden_states.unsqueeze(0)).squeeze()
print(weighted_avg_hidden_states.shape) # [768]

📄 許可證

本項目採用 CC - BY - NC - 4.0 許可證。

📚 引用

@misc{li2023mert,
      title={MERT: Acoustic Music Understanding Model with Large-Scale Self-supervised Training}, 
      author={Yizhi Li and Ruibin Yuan and Ge Zhang and Yinghao Ma and Xingran Chen and Hanzhi Yin and Chenghua Lin and Anton Ragni and Emmanouil Benetos and Norbert Gyenge and Roger Dannenberg and Ruibo Liu and Wenhu Chen and Gus Xia and Yemin Shi and Wenhao Huang and Yike Guo and Jie Fu},
      year={2023},
      eprint={2306.00107},
      archivePrefix={arXiv},
      primaryClass={cs.SD}
}