MERT-v1-330M Open-Source Music Understanding Model - Supports Multiple Music Information Retrieval Tasks, Free Deployment

MERT V1 330M

Developed by m-a-p

MERT-v1-330M is an advanced music understanding model trained based on the MLM paradigm, with a parameter scale of 330M, supporting 24K Hz audio sample rate, and suitable for various music information retrieval tasks.

Audio Classification

Transformers

#Music Understanding #High Sample Rate Audio #Multi-Codebook Pseudo Labels

Downloads 16.92k

Release Time : 3/17/2023

Model Overview

This model adopts the masked language modeling (MLM) pre-training paradigm, trained on a large-scale music dataset (160,000 hours), and possesses excellent music feature extraction and understanding capabilities, suitable for downstream tasks such as music classification and music generation.

Model Features

Large-Scale Pre-training

Trained using 160,000 hours of music data, covering a wide range of music styles and genres

High Audio Quality Processing

Supports 24K Hz high sample rate audio input, capable of capturing richer musical details

Improved MLM Paradigm

Utilizes EnCodec's 8-codebook pseudo labels and intra-batch noise mixing techniques to enhance pre-training effectiveness

Multi-Task Generalization Capability

Demonstrates excellent generalization performance in downstream music understanding tasks

Model Capabilities

Music Feature Extraction

Music Genre Classification

Music Emotion Recognition

Music Generation Support

Use Cases

Music Recommendation Systems

Music Genre Classification

Automatically identifies and classifies the stylistic features of music pieces

Can be used for front-end processing in personalized music recommendation systems

Music Content Analysis

Music Emotion Analysis

Analyzes the emotional characteristics expressed in music pieces

Suitable for application scenarios such as music therapy and emotion recognition

🚀 Introduction to the Music Audio Pre - training (m - a - p) Model Family

This README provides an overview of the Music Audio Pre - training (m - a - p) model series, including their development log, key features, and usage examples. Our models are designed for audio classification tasks in the music domain, offering a range of options to suit different needs.

🚀 Quick Start

If you want to quickly start using the model, you can refer to the following code example:

# from transformers import Wav2Vec2Processor
from transformers import Wav2Vec2FeatureExtractor
from transformers import AutoModel
import torch
from torch import nn
import torchaudio.transforms as T
from datasets import load_dataset

# loading our model weights
model = AutoModel.from_pretrained("m-a-p/MERT-v1-330M", trust_remote_code=True)
# loading the corresponding preprocessor config
processor = Wav2Vec2FeatureExtractor.from_pretrained("m-a-p/MERT-v1-330M",trust_remote_code=True)

# load demo audio and set processor
dataset = load_dataset("hf-internal-testing/librispeech_asr_demo", "clean", split="validation")
dataset = dataset.sort("id")
sampling_rate = dataset.features["audio"].sampling_rate

resample_rate = processor.sampling_rate
# make sure the sample_rate aligned
if resample_rate != sampling_rate:
    print(f'setting rate from {sampling_rate} to {resample_rate}')
    resampler = T.Resample(sampling_rate, resample_rate)
else:
    resampler = None

# audio file is decoded on the fly
if resampler is None:
    input_audio = dataset[0]["audio"]["array"]
else:
    input_audio = resampler(torch.from_numpy(dataset[0]["audio"]["array"]))
  
inputs = processor(input_audio, sampling_rate=resample_rate, return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs, output_hidden_states=True)

# take a look at the output shape, there are 25 layers of representation
# each layer performs differently in different downstream tasks, you should choose empirically
all_layer_hidden_states = torch.stack(outputs.hidden_states).squeeze()
print(all_layer_hidden_states.shape) # [25 layer, Time steps, 1024 feature_dim]

# for utterance level classification tasks, you can simply reduce the representation in time
time_reduced_hidden_states = all_layer_hidden_states.mean(-2)
print(time_reduced_hidden_states.shape) # [25, 1024]

# you can even use a learnable weighted average representation
aggregator = nn.Conv1d(in_channels=25, out_channels=1, kernel_size=1)
weighted_avg_hidden_states = aggregator(time_reduced_hidden_states.unsqueeze(0)).squeeze()
print(weighted_avg_hidden_states.shape) # [1024]

✨ Features

Development Log

02/06/2023: arxiv pre - print and training codes released.
17/03/2023: Released two advanced music understanding models, [MERT - v1 - 95M](https://huggingface.co/m - a - p/MERT - v1 - 95M) and [MERT - v1 - 330M](https://huggingface.co/m - a - p/MERT - v1 - 330M), trained with new paradigm and dataset. They outperform the previous models and can better generalize to more tasks.
14/03/2023: Retrained the MERT - v0 model with open - source - only music dataset [MERT - v0 - public](https://huggingface.co/m - a - p/MERT - v0 - public).
29/12/2022: A music understanding model [MERT - v0](https://huggingface.co/m - a - p/MERT - v0) trained with MLM paradigm, which performs better at downstream tasks.
29/10/2022: A pre - trained MIR model [music2vec](https://huggingface.co/m - a - p/music2vec - v1) trained with BYOL paradigm.

Model Comparison

Property	Details
Model Type	Multiple models including MERT - v1 - 330M, MERT - v1 - 95M, MERT - v0 - public, MERT - v0, music2vec - v1
Pre - train Paradigm	MLM, BYOL
Training Data	Ranging from 900 hours to 160K hours
Pre - train Context	3 - 30 seconds
Model Size	95M, 330M
Transformer Layer - Dimension	12 - 768, 24 - 1024
Feature Rate	50 Hz, 75 Hz
Sample Rate	16K Hz, 24K Hz
Release Date	From 29/10/2022 to 02/06/2023

Key Improvements in MERT - v1

Changed the pseudo labels to 8 codebooks from encodec, potentially with higher quality and enabling music generation.
MLM prediction with in - batch noise mixture.
Trained with higher audio frequency (24K Hz).
Trained with more audio data (up to 160 thousands of hours).
More available model sizes 95M and 330M.

📚 Documentation

Model Explanation

The m - a - p models share a similar architecture, and the main difference lies in the pre - training paradigm. Here are some technical configurations to note:

Model Size: The number of parameters loaded into memory. Choose an appropriate size based on your hardware.
Transformer Layer - Dimension: The number of transformer layers and corresponding feature dimensions output by the model. Different layers may have different performance depending on tasks.
Feature Rate: The number of features output by the model for a 1 - second audio input.
Sample Rate: The audio frequency used for model training.

💻 Usage Examples

Basic Usage

# from transformers import Wav2Vec2Processor
from transformers import Wav2Vec2FeatureExtractor
from transformers import AutoModel
import torch
from torch import nn
import torchaudio.transforms as T
from datasets import load_dataset

# loading our model weights
model = AutoModel.from_pretrained("m-a-p/MERT-v1-330M", trust_remote_code=True)
# loading the corresponding preprocessor config
processor = Wav2Vec2FeatureExtractor.from_pretrained("m-a-p/MERT-v1-330M",trust_remote_code=True)

# load demo audio and set processor
dataset = load_dataset("hf-internal-testing/librispeech_asr_demo", "clean", split="validation")
dataset = dataset.sort("id")
sampling_rate = dataset.features["audio"].sampling_rate

resample_rate = processor.sampling_rate
# make sure the sample_rate aligned
if resample_rate != sampling_rate:
    print(f'setting rate from {sampling_rate} to {resample_rate}')
    resampler = T.Resample(sampling_rate, resample_rate)
else:
    resampler = None

# audio file is decoded on the fly
if resampler is None:
    input_audio = dataset[0]["audio"]["array"]
else:
    input_audio = resampler(torch.from_numpy(dataset[0]["audio"]["array"]))
  
inputs = processor(input_audio, sampling_rate=resample_rate, return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs, output_hidden_states=True)

# take a look at the output shape, there are 25 layers of representation
# each layer performs differently in different downstream tasks, you should choose empirically
all_layer_hidden_states = torch.stack(outputs.hidden_states).squeeze()
print(all_layer_hidden_states.shape) # [25 layer, Time steps, 1024 feature_dim]

# for utterance level classification tasks, you can simply reduce the representation in time
time_reduced_hidden_states = all_layer_hidden_states.mean(-2)
print(time_reduced_hidden_states.shape) # [25, 1024]

# you can even use a learnable weighted average representation
aggregator = nn.Conv1d(in_channels=25, out_channels=1, kernel_size=1)
weighted_avg_hidden_states = aggregator(time_reduced_hidden_states.unsqueeze(0)).squeeze()
print(weighted_avg_hidden_states.shape) # [1024]

📄 License

This project is licensed under the CC - BY - NC - 4.0 license.

📚 Citation

@misc{li2023mert,
      title={MERT: Acoustic Music Understanding Model with Large-Scale Self-supervised Training}, 
      author={Yizhi Li and Ruibin Yuan and Ge Zhang and Yinghao Ma and Xingran Chen and Hanzhi Yin and Chenghua Lin and Anton Ragni and Emmanouil Benetos and Norbert Gyenge and Roger Dannenberg and Ruibo Liu and Wenhu Chen and Gus Xia and Yemin Shi and Wenhao Huang and Yike Guo and Jie Fu},
      year={2023},
      eprint={2306.00107},
      archivePrefix={arXiv},
      primaryClass={cs.SD}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご