MERT-v1-330M開源音樂理解模型 - 支持多音樂信息檢索任務，免費部署

首頁

MERT V1 330M

由m-a-p開發

MERT-v1-330M是一個基於MLM範式訓練的高級音樂理解模型，具有330M參數規模，支持24K Hz音頻採樣率，適用於多種音樂信息檢索任務。

音頻分類

Transformers

#音樂理解 #高採樣率音頻 #多碼本偽標籤

下載量 16.92k

發布時間 : 3/17/2023

模型概述

該模型採用掩碼語言建模(MLM)預訓練範式，通過大規模音樂數據集(160,000小時)訓練，具備優秀的音樂特徵提取和理解能力，適用於音樂分類、音樂生成等下游任務。

模型特點

大規模預訓練

使用160,000小時音樂數據進行訓練，覆蓋廣泛的音樂風格和類型

高音頻質量處理

支持24K Hz高採樣率音頻輸入，能捕捉更豐富的音樂細節

改進的MLM範式

採用EnCodec的8碼本偽標籤和批內噪聲混合技術，提升預訓練效果

多任務泛化能力

在下游音樂理解任務中表現出優秀的泛化性能

模型能力

音樂特徵提取

音樂風格分類

音樂情感識別

音樂生成支持

使用案例

音樂推薦系統

音樂風格分類

自動識別和分類音樂作品的風格特徵

可用於個性化音樂推薦系統的前端處理

音樂內容分析

音樂情感分析

分析音樂作品表達的情感特徵

適用於音樂治療、情緒識別等應用場景

🚀 音樂音頻預訓練（m - a - p）模型系列介紹

本項目的音樂音頻預訓練（m - a - p）模型家族致力於解決音樂理解和處理相關問題，通過不同的預訓練範式和技術配置，為音樂領域的各類任務提供強大的支持，具有廣泛的應用價值。

🚀 快速開始

模型開發日誌

2023年6月2日：arxiv預印本和訓練代碼發佈。
2023年3月17日：發佈兩個先進的音樂理解模型，[MERT - v1 - 95M](https://huggingface.co/m - a - p/MERT - v1 - 95M)和[MERT - v1 - 330M](https://huggingface.co/m - a - p/MERT - v1 - 330M)，採用新範式和數據集進行訓練。它們優於之前的模型，能更好地泛化到更多任務。
2023年3月14日：使用僅開源的音樂數據集重新訓練MERT - v0模型，得到[MERT - v0 - public](https://huggingface.co/m - a - p/MERT - v0 - public)。
2022年12月29日：發佈一個使用MLM範式訓練的音樂理解模型[MERT - v0](https://huggingface.co/m - a - p/MERT - v0)，在下游任務中表現更好。
2022年10月29日：發佈一個使用BYOL範式訓練的預訓練MIR模型[music2vec](https://huggingface.co/m - a - p/music2vec - v1)。

模型快速選擇表

屬性	詳情
模型類型	包括[MERT - v1 - 330M](https://huggingface.co/m - a - p/MERT - v1 - 330M)、[MERT - v1 - 95M](https://huggingface.co/m - a - p/MERT - v1 - 95M)、[MERT - v0 - public](https://huggingface.co/m - a - p/MERT - v0 - public)、[MERT - v0](https://huggingface.co/m - a - p/MERT - v0)、[music2vec - v1](https://huggingface.co/m - a - p/music2vec - v1)
預訓練範式	有MLM和BYOL兩種
訓練數據（小時）	從900小時到160000小時不等
預訓練上下文（秒）	主要為5秒，music2vec - v1為30秒
模型大小	有95M和330M等不同規格
Transformer層 - 維度	如24 - 1024、12 - 768等
特徵速率	50 Hz或75 Hz
採樣率	16K Hz或24K Hz
發佈日期	從2022年10月30日到2023年6月2日

✨ 主要特性

模型共性與差異

m - a - p模型具有相似的模型架構，最顯著的區別在於預訓練中使用的範式。此外，在使用前還需要了解以下幾個細微的技術配置：

模型大小：指加載到內存中的參數數量。請選擇適合您硬件的適當大小。
Transformer層 - 維度：模型可以輸出的Transformer層數和相應的特徵維度。這一點需要注意，因為不同層提取的特徵在不同任務中可能有不同的表現。
特徵速率：給定1秒的音頻輸入，模型輸出的特徵數量。
採樣率：模型訓練所使用的音頻頻率。

MERT - v1的新特性

與MERT - v0相比，MERT - v1在預訓練中引入了多個新特性：

將偽標籤更改為來自encodec的8個碼本，可能具有更高的質量，並使模型支持音樂生成。
採用批量內噪聲混合進行MLM預測。
使用更高的音頻頻率（24K Hz）進行訓練。
使用更多的音頻數據（最多160000小時）進行訓練。
提供更多可用的模型大小，如95M和330M。

💻 使用示例

基礎用法

# from transformers import Wav2Vec2Processor
from transformers import Wav2Vec2FeatureExtractor
from transformers import AutoModel
import torch
from torch import nn
import torchaudio.transforms as T
from datasets import load_dataset

# loading our model weights
model = AutoModel.from_pretrained("m-a-p/MERT-v1-330M", trust_remote_code=True)
# loading the corresponding preprocessor config
processor = Wav2Vec2FeatureExtractor.from_pretrained("m-a-p/MERT-v1-330M",trust_remote_code=True)

# load demo audio and set processor
dataset = load_dataset("hf-internal-testing/librispeech_asr_demo", "clean", split="validation")
dataset = dataset.sort("id")
sampling_rate = dataset.features["audio"].sampling_rate

resample_rate = processor.sampling_rate
# make sure the sample_rate aligned
if resample_rate != sampling_rate:
    print(f'setting rate from {sampling_rate} to {resample_rate}')
    resampler = T.Resample(sampling_rate, resample_rate)
else:
    resampler = None

# audio file is decoded on the fly
if resampler is None:
    input_audio = dataset[0]["audio"]["array"]
else:
    input_audio = resampler(torch.from_numpy(dataset[0]["audio"]["array"]))
  
inputs = processor(input_audio, sampling_rate=resample_rate, return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs, output_hidden_states=True)

# take a look at the output shape, there are 25 layers of representation
# each layer performs differently in different downstream tasks, you should choose empirically
all_layer_hidden_states = torch.stack(outputs.hidden_states).squeeze()
print(all_layer_hidden_states.shape) # [25 layer, Time steps, 1024 feature_dim]

# for utterance level classification tasks, you can simply reduce the representation in time
time_reduced_hidden_states = all_layer_hidden_states.mean(-2)
print(time_reduced_hidden_states.shape) # [25, 1024]

# you can even use a learnable weighted average representation
aggregator = nn.Conv1d(in_channels=25, out_channels=1, kernel_size=1)
weighted_avg_hidden_states = aggregator(time_reduced_hidden_states.unsqueeze(0)).squeeze()
print(weighted_avg_hidden_states.shape) # [1024]

📚 詳細文檔

更多詳細信息將在我們即將發佈的論文中介紹。

📄 許可證

本項目採用CC - BY - NC - 4.0許可證。

📖 引用

@misc{li2023mert,
      title={MERT: Acoustic Music Understanding Model with Large-Scale Self-supervised Training}, 
      author={Yizhi Li and Ruibin Yuan and Ge Zhang and Yinghao Ma and Xingran Chen and Hanzhi Yin and Chenghua Lin and Anton Ragni and Emmanouil Benetos and Norbert Gyenge and Roger Dannenberg and Ruibo Liu and Wenhu Chen and Gus Xia and Yemin Shi and Wenhao Huang and Yike Guo and Jie Fu},
      year={2023},
      eprint={2306.00107},
      archivePrefix={arXiv},
      primaryClass={cs.SD}
}