MERT-v1-330M Open-Source Music Understanding Model - Free Support for Music Information Retrieval Tasks!

MERT V1 95M

Developed by m-a-p

MERT-v1-330M is an advanced music understanding model trained based on the MLM paradigm, with 330M parameters, supporting a 24K Hz audio sampling rate and 75 Hz feature rate, suitable for various music information retrieval tasks.

Audio Classification

Transformers

#Music Understanding #High Sampling Rate #Large-scale Pre-training

Downloads 83.72k

Release Time : 3/17/2023

Model Overview

MERT-v1-330M is a music audio pre-training model trained using the MLM paradigm, offering stronger task generalization capabilities and higher audio sampling rates, suitable for tasks such as music classification and music generation.

Model Features

High Audio Sampling Rate

Supports a 24K Hz audio sampling rate, providing higher-quality audio processing capabilities.

Large-scale Training Data

Trained using 160K hours of music data, the model exhibits stronger generalization capabilities.

Multi-codebook Pseudo-labeling

Utilizes Encodec's 8-codebook pseudo-labeling to enhance quality and support music generation tasks.

In-batch Noise Mixing

Introduces MLM prediction with in-batch noise mixing to enhance model robustness.

Model Capabilities

Music Classification

Music Information Retrieval

Music Generation

Use Cases

Music Analysis

Music Genre Classification

Classifies music segments into genres such as pop, classical, jazz, etc.

Outperforms previous-generation models in multiple downstream tasks.

Music Emotion Recognition

Identifies emotional features in music, such as happiness, sadness, anger, etc.

Music Generation

Music Segment Generation

Generates new music segments based on input audio features.

🚀 Introduction to the Music Audio Pre - training (m - a - p) Model Family

Our m - a - p model family offers a series of advanced music understanding models, which are trained with different paradigms and datasets to meet various music - related tasks.

🚀 Quick Start

This README provides an overview of the development log, model details, and usage of the m - a - p model family. You can quickly understand the features of each model and how to use them through the following content.

✨ Features

Multiple Models: A variety of models are available, including MERT - v0, MERT - v1 - 95M, MERT - v1 - 330M, etc., trained with different pre - training paradigms.
Rich Technical Details: Information such as model size, transformer layer - dimension, feature rate, and sample rate is provided to help users select the appropriate model.
Music Generation Support: MERT - v1 has the potential to support music generation by using 8 codebooks from encodec.

📦 Installation

There is no specific installation steps provided in the original README. If you want to use the model, you can follow the code examples in the "Model Usage" section to load the model weights and the corresponding preprocessor.

💻 Usage Examples

Basic Usage

# from transformers import Wav2Vec2Processor
from transformers import Wav2Vec2FeatureExtractor
from transformers import AutoModel
import torch
from torch import nn
import torchaudio.transforms as T
from datasets import load_dataset


# loading our model weights
model = AutoModel.from_pretrained("m-a-p/MERT-v1-95M", trust_remote_code=True)
# loading the corresponding preprocessor config
processor = Wav2Vec2FeatureExtractor.from_pretrained("m-a-p/MERT-v1-95M",trust_remote_code=True)

# load demo audio and set processor
dataset = load_dataset("hf-internal-testing/librispeech_asr_demo", "clean", split="validation")
dataset = dataset.sort("id")
sampling_rate = dataset.features["audio"].sampling_rate

resample_rate = processor.sampling_rate
# make sure the sample_rate aligned
if resample_rate != sampling_rate:
    print(f'setting rate from {sampling_rate} to {resample_rate}')
    resampler = T.Resample(sampling_rate, resample_rate)
else:
    resampler = None

# audio file is decoded on the fly
if resampler is None:
    input_audio = dataset[0]["audio"]["array"]
else:
    input_audio = resampler(torch.from_numpy(dataset[0]["audio"]["array"]))
  
inputs = processor(input_audio, sampling_rate=resample_rate, return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs, output_hidden_states=True)

# take a look at the output shape, there are 13 layers of representation
# each layer performs differently in different downstream tasks, you should choose empirically
all_layer_hidden_states = torch.stack(outputs.hidden_states).squeeze()
print(all_layer_hidden_states.shape) # [13 layer, Time steps, 768 feature_dim]

# for utterance level classification tasks, you can simply reduce the representation in time
time_reduced_hidden_states = all_layer_hidden_states.mean(-2)
print(time_reduced_hidden_states.shape) # [13, 768]

# you can even use a learnable weighted average representation
aggregator = nn.Conv1d(in_channels=13, out_channels=1, kernel_size=1)
weighted_avg_hidden_states = aggregator(time_reduced_hidden_states.unsqueeze(0)).squeeze()
print(weighted_avg_hidden_states.shape) # [768]

📚 Documentation

Development Log

The development log of our Music Audio Pre - training (m - a - p) model family:

02/06/2023: arxiv pre - print and training codes released.
17/03/2023: we release two advanced music understanding models, [MERT - v1 - 95M](https://huggingface.co/m - a - p/MERT - v1 - 95M) and [MERT - v1 - 330M](https://huggingface.co/m - a - p/MERT - v1 - 330M) , trained with new paradigm and dataset. They outperform the previous models and can better generalize to more tasks.
14/03/2023: we retrained the MERT - v0 model with open - source - only music dataset [MERT - v0 - public](https://huggingface.co/m - a - p/MERT - v0 - public)
29/12/2022: a music understanding model [MERT - v0](https://huggingface.co/m - a - p/MERT - v0) trained with MLM paradigm, which performs better at downstream tasks.
29/10/2022: a pre - trained MIR model [music2vec](https://huggingface.co/m - a - p/music2vec - v1) trained with BYOL paradigm.

Model Comparison Table

Name	Pre - train Paradigm	Training Data (hour)	Pre - train Context (second)	Model Size	Transformer Layer - Dimension	Feature Rate	Sample Rate	Release Date
[MERT - v1 - 330M](https://huggingface.co/m - a - p/MERT - v1 - 330M)	MLM	160K	5	330M	24 - 1024	75 Hz	24K Hz	17/03/2023
[MERT - v1 - 95M](https://huggingface.co/m - a - p/MERT - v1 - 95M)	MLM	20K	5	95M	12 - 768	75 Hz	24K Hz	17/03/2023
[MERT - v0 - public](https://huggingface.co/m - a - p/MERT - v0 - public)	MLM	900	5	95M	12 - 768	50 Hz	16K Hz	14/03/2023
[MERT - v0](https://huggingface.co/m - a - p/MERT - v0)	MLM	1000	5	95 M	12 - 768	50 Hz	16K Hz	29/12/2022
[music2vec - v1](https://huggingface.co/m - a - p/music2vec - v1)	BYOL	1000	30	95 M	12 - 768	50 Hz	16K Hz	30/10/2022

Explanation

The m - a - p models share the similar model architecture and the most distinguished difference is the paradigm in used pre - training. Other than that, there are several nuance technical configuration needs to know before using:

Model Size: the number of parameters that would be loaded to memory. Please select the appropriate size fitting your hardware.
Transformer Layer - Dimension: The number of transformer layers and the corresponding feature dimensions can be outputted from our model. This is marked out because features extracted by different layers could have various performance depending on tasks.
Feature Rate: Given a 1 - second audio input, the number of features output by the model.
Sample Rate: The frequency of audio that the model is trained with.

Introduction to MERT - v1

Compared to MERT - v0, we introduce multiple new things in the MERT - v1 pre - training:

Change the pseudo labels to 8 codebooks from encodec, which potentially has higher quality and empower our model to support music generation.
MLM prediction with in - batch noise mixture.
Train with higher audio frequency (24K Hz).
Train with more audio data (up to 160 thousands of hours).
More available model sizes 95M and 330M.

More details will be written in our coming - soon paper.

🔧 Technical Details

The m - a - p models have different technical configurations, such as model size, transformer layer - dimension, feature rate, and sample rate. These configurations affect the performance and application scenarios of the models. For example, different transformer layers can extract features with different performance depending on tasks.

📄 License

The model is licensed under cc - by - nc - 4.0.

📖 Citation

@misc{li2023mert,
      title={MERT: Acoustic Music Understanding Model with Large-Scale Self-supervised Training}, 
      author={Yizhi Li and Ruibin Yuan and Ge Zhang and Yinghao Ma and Xingran Chen and Hanzhi Yin and Chenghua Lin and Anton Ragni and Emmanouil Benetos and Norbert Gyenge and Roger Dannenberg and Ruibo Liu and Wenhu Chen and Gus Xia and Yemin Shi and Wenhao Huang and Yike Guo and Jie Fu},
      year={2023},
      eprint={2306.00107},
      archivePrefix={arXiv},
      primaryClass={cs.SD}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご