đ Introduction to the Music Audio Pre - training (m - a - p) Model Family
This README provides an overview of the Music Audio Pre - training (m - a - p) model series, including their development log, key features, and usage examples. Our models are designed for audio classification tasks in the music domain, offering a range of options to suit different needs.
đ Quick Start
If you want to quickly start using the model, you can refer to the following code example:
from transformers import Wav2Vec2FeatureExtractor
from transformers import AutoModel
import torch
from torch import nn
import torchaudio.transforms as T
from datasets import load_dataset
model = AutoModel.from_pretrained("m-a-p/MERT-v1-330M", trust_remote_code=True)
processor = Wav2Vec2FeatureExtractor.from_pretrained("m-a-p/MERT-v1-330M",trust_remote_code=True)
dataset = load_dataset("hf-internal-testing/librispeech_asr_demo", "clean", split="validation")
dataset = dataset.sort("id")
sampling_rate = dataset.features["audio"].sampling_rate
resample_rate = processor.sampling_rate
if resample_rate != sampling_rate:
print(f'setting rate from {sampling_rate} to {resample_rate}')
resampler = T.Resample(sampling_rate, resample_rate)
else:
resampler = None
if resampler is None:
input_audio = dataset[0]["audio"]["array"]
else:
input_audio = resampler(torch.from_numpy(dataset[0]["audio"]["array"]))
inputs = processor(input_audio, sampling_rate=resample_rate, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs, output_hidden_states=True)
all_layer_hidden_states = torch.stack(outputs.hidden_states).squeeze()
print(all_layer_hidden_states.shape)
time_reduced_hidden_states = all_layer_hidden_states.mean(-2)
print(time_reduced_hidden_states.shape)
aggregator = nn.Conv1d(in_channels=25, out_channels=1, kernel_size=1)
weighted_avg_hidden_states = aggregator(time_reduced_hidden_states.unsqueeze(0)).squeeze()
print(weighted_avg_hidden_states.shape)
⨠Features
Development Log
- 02/06/2023: arxiv pre - print and training codes released.
- 17/03/2023: Released two advanced music understanding models, [MERT - v1 - 95M](https://huggingface.co/m - a - p/MERT - v1 - 95M) and [MERT - v1 - 330M](https://huggingface.co/m - a - p/MERT - v1 - 330M), trained with new paradigm and dataset. They outperform the previous models and can better generalize to more tasks.
- 14/03/2023: Retrained the MERT - v0 model with open - source - only music dataset [MERT - v0 - public](https://huggingface.co/m - a - p/MERT - v0 - public).
- 29/12/2022: A music understanding model [MERT - v0](https://huggingface.co/m - a - p/MERT - v0) trained with MLM paradigm, which performs better at downstream tasks.
- 29/10/2022: A pre - trained MIR model [music2vec](https://huggingface.co/m - a - p/music2vec - v1) trained with BYOL paradigm.
Model Comparison
Property |
Details |
Model Type |
Multiple models including MERT - v1 - 330M, MERT - v1 - 95M, MERT - v0 - public, MERT - v0, music2vec - v1 |
Pre - train Paradigm |
MLM, BYOL |
Training Data |
Ranging from 900 hours to 160K hours |
Pre - train Context |
3 - 30 seconds |
Model Size |
95M, 330M |
Transformer Layer - Dimension |
12 - 768, 24 - 1024 |
Feature Rate |
50 Hz, 75 Hz |
Sample Rate |
16K Hz, 24K Hz |
Release Date |
From 29/10/2022 to 02/06/2023 |
Key Improvements in MERT - v1
- Changed the pseudo labels to 8 codebooks from encodec, potentially with higher quality and enabling music generation.
- MLM prediction with in - batch noise mixture.
- Trained with higher audio frequency (24K Hz).
- Trained with more audio data (up to 160 thousands of hours).
- More available model sizes 95M and 330M.
đ Documentation
Model Explanation
The m - a - p models share a similar architecture, and the main difference lies in the pre - training paradigm. Here are some technical configurations to note:
- Model Size: The number of parameters loaded into memory. Choose an appropriate size based on your hardware.
- Transformer Layer - Dimension: The number of transformer layers and corresponding feature dimensions output by the model. Different layers may have different performance depending on tasks.
- Feature Rate: The number of features output by the model for a 1 - second audio input.
- Sample Rate: The audio frequency used for model training.
đģ Usage Examples
Basic Usage
from transformers import Wav2Vec2FeatureExtractor
from transformers import AutoModel
import torch
from torch import nn
import torchaudio.transforms as T
from datasets import load_dataset
model = AutoModel.from_pretrained("m-a-p/MERT-v1-330M", trust_remote_code=True)
processor = Wav2Vec2FeatureExtractor.from_pretrained("m-a-p/MERT-v1-330M",trust_remote_code=True)
dataset = load_dataset("hf-internal-testing/librispeech_asr_demo", "clean", split="validation")
dataset = dataset.sort("id")
sampling_rate = dataset.features["audio"].sampling_rate
resample_rate = processor.sampling_rate
if resample_rate != sampling_rate:
print(f'setting rate from {sampling_rate} to {resample_rate}')
resampler = T.Resample(sampling_rate, resample_rate)
else:
resampler = None
if resampler is None:
input_audio = dataset[0]["audio"]["array"]
else:
input_audio = resampler(torch.from_numpy(dataset[0]["audio"]["array"]))
inputs = processor(input_audio, sampling_rate=resample_rate, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs, output_hidden_states=True)
all_layer_hidden_states = torch.stack(outputs.hidden_states).squeeze()
print(all_layer_hidden_states.shape)
time_reduced_hidden_states = all_layer_hidden_states.mean(-2)
print(time_reduced_hidden_states.shape)
aggregator = nn.Conv1d(in_channels=25, out_channels=1, kernel_size=1)
weighted_avg_hidden_states = aggregator(time_reduced_hidden_states.unsqueeze(0)).squeeze()
print(weighted_avg_hidden_states.shape)
đ License
This project is licensed under the CC - BY - NC - 4.0 license.
đ Citation
@misc{li2023mert,
title={MERT: Acoustic Music Understanding Model with Large-Scale Self-supervised Training},
author={Yizhi Li and Ruibin Yuan and Ge Zhang and Yinghao Ma and Xingran Chen and Hanzhi Yin and Chenghua Lin and Anton Ragni and Emmanouil Benetos and Norbert Gyenge and Roger Dannenberg and Ruibo Liu and Wenhu Chen and Gus Xia and Yemin Shi and Wenhao Huang and Yike Guo and Jie Fu},
year={2023},
eprint={2306.00107},
archivePrefix={arXiv},
primaryClass={cs.SD}
}