Perceiver AR SAM Giant MIDI open-source symbolic audio model - Achieve high-quality symbolic audio generation for free

Perceiver Ar Sam Giant Midi

Developed by krasserm

A symbolic audio model based on the Perceiver AR architecture, pre-trained on the GiantMIDI-Piano dataset for symbolic audio generation

Audio Generation

Transformers

Open Source License:Apache-2.0 #Piano music generation #Long sequence modeling #Symbolic audio processing

Downloads 153

Release Time : 5/3/2023

Model Overview

This model is a symbolic audio model based on the Perceiver AR architecture, primarily used for audio generation based on a user-defined initial number of latent tokens.

Model Features

Long-context processing capability

Through a hybrid of self-attention and cross-attention mechanisms, it can handle longer contexts (up to 6144 tokens) than pure self-attention decoders.

Rotary position encoding

Uses rotary position encoding for relative position encoding, enhancing the model's understanding of sequence positional relationships.

Symbolic audio modeling

Specifically designed for modeling and generating symbolic audio data in MIDI format.

Model Capabilities

Symbolic audio generation

Music continuation

MIDI file generation

Use Cases

Music creation

Music continuation

Automatically generates subsequent musical content based on user-provided musical prompts

Can generate stylistically coherent music continuations

Music style imitation

Learns from MIDI data of specific styles to generate music in similar styles

Can imitate musical style characteristics from the training data

Education & entertainment

Music creation assistance

Provides inspiration and material for music learners

Can generate simple melodies for learning and adaptation

🚀 Perceiver AR symbolic audio model

This is a Perceiver AR symbolic audio model with 134M parameters. It's pretrained on the GiantMIDI - Piano dataset and can be used for audio generation.

🚀 Quick Start

This Perceiver AR symbolic audio model is pretrained on the [GiantMIDI - Piano](https://github.com/bytedance/GiantMIDI - Piano) dataset. To use it, you first need to install the necessary library and then can generate MIDI or WAV files with PyTorch.

✨ Features

Extended Context Processing: Perceiver AR can cross - attend to a longer prefix of the input sequence, allowing it to process a much larger context than traditional decoder - only transformers.
Relative Position Encoding: It uses rotary embedding for relative position encoding.
Training on MIDI Dataset: Pretrained on the GiantMIDI - Piano dataset for symbolic audio generation.

📦 Installation

To use this model, you first need to [install](https://github.com/krasserm/perceiver - io/blob/main/README.md#installation) the perceiver - io library with extension audio.

pip install perceiver-io[audio]

💻 Usage Examples

Basic Usage

import torch

from perceiver.model.audio.symbolic import PerceiverSymbolicAudioModel
from perceiver.data.audio.midi_processor import decode_midi, encode_midi
from pretty_midi import PrettyMIDI

repo_id = "krasserm/perceiver-ar-sam-giant-midi"

model = PerceiverSymbolicAudioModel.from_pretrained(repo_id)

prompt_midi = PrettyMIDI("prompt.mid")
prompt = torch.tensor(encode_midi(prompt_midi)).unsqueeze(0)

output = model.generate(prompt, max_new_tokens=64, num_latents=1, do_sample=True, top_p=0.95, temperature=1.0)

output_midi = decode_midi(output[0].cpu().numpy())
type(output_midi)

pretty_midi.pretty_midi.PrettyMIDI

Advanced Usage

Use a symbolic - audio - generation pipeline to generate a MIDI output:

from transformers import pipeline
from pretty_midi import PrettyMIDI
from perceiver.model.audio import symbolic  # auto - class registration

repo_id = "krasserm/perceiver-ar-sam-giant-midi"

prompt = PrettyMIDI("prompt.mid")
audio_generator = pipeline("symbolic-audio-generation", model=repo_id)

output = audio_generator(prompt, max_new_tokens=64, num_latents=1, do_sample=True, top_p=0.95, temperature=1.0)
type(output["generated_audio_midi"])

pretty_midi.pretty_midi.PrettyMIDI

Generate WAV output by rendering the MIDI symbols using fluidsynth (Note: fluidsynth must be installed in order for the following example to work):

from transformers import pipeline
from pretty_midi import PrettyMIDI
from perceiver.model.audio import symbolic  # auto - class registration

repo_id = "krasserm/perceiver-ar-sam-giant-midi"

prompt = PrettyMIDI("prompt.mid")
audio_generator = pipeline("symbolic-audio-generation", model=repo_id)

output = audio_generator(prompt, max_new_tokens=64, num_latents=1, do_sample=True, top_p=0.95, temperature=1.0, render=True)

with open("generated_audio.wav", "wb") as f:
    f.write(output["generated_audio_wav"])

📚 Documentation

Model description

Perceiver AR is a simple extension of a plain decoder - only transformer such as GPT - 2. A core building block of both is the decoder layer consisting of a self - attention layer followed by a position - wise MLP. Self - attention uses a causal attention mask.

Perceiver AR additionally cross - attends to a longer prefix of the input sequence in its first attention layer. This layer is a hybrid self - and cross - attention layer. Self - attention is over the last n positions of the input sequence, with a causal attention mask, cross - attention is from the last n positions to the first m positions. The length of the input sequence is m + n. This allows a Perceiver AR to process a much larger context than decoder - only transformers which are based on self - attention only.

Fig. 1. Attention in Perceiver AR with m = 8 prefix tokens and n = 3 latent tokens.

The output of the hybrid attention layer are n latent arrays corresponding to the last n tokens of the input sequence. These are further processed by a stack of L - 1 decoder layers where the total number of attention layers is L. A final layer (not shown in Fig. 1) predicts the target token for each latent position. The weights of the final layer are shared with the input embedding layer. Except for the initial cross - attention to the prefix sequence, a Perceiver AR is architecturally identical to a decoder - only transformer.

Model training

The model was [trained](https://github.com/krasserm/perceiver - io/blob/main/docs/training - examples.md#giantmidi - piano) with the task of symbolic audio modeling on the [GiantMIDI - Piano](https://github.com/bytedance/GiantMIDI - Piano) dataset for 27 epochs (157M tokens). This dataset consists of MIDI files, tokenized using the approach from the Perceiver AR paper, which is described in detail in Section A.2 of Huang et al (2019).

All hyperparameters are summarized in the [training script](https://github.com/krasserm/perceiver - io/blob/main/examples/training/sam/giantmidi/train.sh). The context length was set to 6144 tokens with 2048 latent positions, resulting in a maximal prefix length of 4096. The actual prefix length per example was randomly chosen between 0 and 4096. Training was done with PyTorch Lightning and the resulting checkpoint was converted to this 🤗 model with a library - specific [conversion utility](#checkpoint - conversion).

Intended use and limitations

This model can be used for audio generation with a user - defined initial number of latent tokens. It mainly exists for demonstration purposes on how to train Perceiver AR models with the [perceiver - io library](https://github.com/krasserm/perceiver - io). To improve on the quality of the generated audio samples a much larger dataset than [GiantMIDI - Piano](https://github.com/bytedance/GiantMIDI - Piano) is required for training.

Checkpoint conversion

The krasserm/perceiver - ar - sam - giant - midi model has been created from a training checkpoint with:

from perceiver.model.audio.symbolic import convert_checkpoint

convert_checkpoint(
    save_dir="krasserm/perceiver-ar-sam-giant-midi",
    ckpt_url="https://martin-krasser.com/perceiver/logs-0.8.0/sam/version_1/checkpoints/epoch=027-val_loss=1.944.ckpt",
    push_to_hub=True,
)

🔧 Technical Details

Model Architecture: Based on Perceiver AR, an extension of decoder - only transformers.
Position Encoding: Uses rotary embedding for relative position encoding.
Training Dataset: GiantMIDI - Piano dataset, tokenized for symbolic audio modeling.
Training Epochs: 27 epochs with 157M tokens.
Context and Latent Length: Context length is 6144 tokens with 2048 latent positions, max prefix length is 4096.

📄 License

This project is licensed under the Apache - 2.0 license.

Audio samples

The following (hand - picked) audio samples were generated using various prompts from the validation subset of the [GiantMIDI - Piano](https://github.com/bytedance/GiantMIDI - Piano) dataset. The input prompts are not included in the audio output.

Top - K	Top - p	Temperature	Prefix length	Latents
-	0.95	0.95	4096	1
-	0.95	1.0	4096	64
-	0.95	1.0	1024	1
15	-	1.0	4096	16
-	0.95	1.0	4096	1

Citation

@inproceedings{hawthorne2022general,
  title={General-purpose, long-context autoregressive modeling with perceiver ar},
  author={Hawthorne, Curtis and Jaegle, Andrew and Cangea, C{\u{a}}t{\u{a}}lina and Borgeaud, Sebastian and Nash, Charlie and Malinowski, Mateusz and Dieleman, Sander and Vinyals, Oriol and Botvinick, Matthew and Simon, Ian and others},
  booktitle={International Conference on Machine Learning},
  pages={8535--8558},
  year={2022},
  organization={PMLR}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご