Riffusion Model V1 Open-Source Music Generation Model - Create Unique Audio Clips in Real-Time Based on Text

Riffusion Model V1

Developed by riffusion

Riffusion is a real-time music generation application based on Stable Diffusion technology, capable of generating spectrograms from text input and converting them into audio clips.

Text-to-Audio Open Source License:Openrail #Text-to-Audio Spectrogram #Real-time Music Generation #Stable Diffusion Fine-tuning

Downloads 2,354

Release Time : 12/13/2022

Model Overview

Riffusion is a latent text-to-image diffusion model fine-tuned from the Stable-Diffusion-v1-5 checkpoint for generating musical spectrograms that can be converted into audio clips.

Model Features

Real-time Music Generation

Capable of generating music spectrograms from text prompts in real-time and converting them into audio clips

Stable Diffusion Technology

Utilizes a fine-tuned Stable-Diffusion-v1-5 model architecture with powerful generative capabilities

Open License

Adopts the CreativeML OpenRAIL-M license, permitting commercial and research use

Model Capabilities

Text-to-Spectrogram Generation

Spectrogram-to-Audio Conversion

Real-time Music Composition

Creative Audio Generation

Use Cases

Artistic Creation

Music Composition Generation

Automatically generates original music clips based on text descriptions

Produces playable audio files

Educational Tool

Music Concept Teaching

Assists in music theory education through visualized spectrograms

Intuitively demonstrates the relationship between audio and spectrograms

Research & Development

Generative Model Research

Explores cross-modal (text-to-audio) generation technologies

🚀 Riffusion

Riffusion is an app for real-time music generation with stable diffusion. It enables users to generate music in real - time. You can learn more about it at https://www.riffusion.com/about and have a try at https://www.riffusion.com/.

Code: https://github.com/riffusion/riffusion
Web app: https://github.com/hmartiro/riffusion-app
Model checkpoint: https://huggingface.co/riffusion/riffusion-model-v1
Discord: https://discord.gg/yu6SRwvX4v

This repository contains the model files, including a diffusers formated library, a compiled checkpoint file, a traced unet for improved inference speed, and a seed image library for use with riffusion - app.

✨ Features

Real - time music generation using stable diffusion.
Capable of generating spectrogram images from text input, which can be converted into audio clips.

📦 Installation

No installation steps are provided in the original README, so this section is skipped.

💻 Usage Examples

No code examples are provided in the original README, so this section is skipped.

📚 Documentation

Riffusion v1 Model

Riffusion is a latent text - to - image diffusion model. It can generate spectrogram images from any text input, and these spectrograms can be converted into audio clips.

The model was created by Seth Forsgren and Hayk Martiros as a hobby project. You can either use the Riffusion model directly or try the Riffusion web app.

The Riffusion model was created by fine - tuning the Stable - Diffusion - v1 - 5 checkpoint. You can read about Stable Diffusion in 🤗's Stable Diffusion blog.

Model Details

Property	Details
Developed by	Seth Forsgren, Hayk Martiros
Model Type	Diffusion - based text - to - image generation model
Language(s)	English
License	[The CreativeML OpenRAIL M license](https://huggingface.co/spaces/CompVis/stable - diffusion - license) is an [Open RAIL M license](https://www.licenses.ai/blog/2022/8/18/naming - convention - of - responsible - ai - licenses), adapted from the work that BigScience and the RAIL Initiative are jointly carrying in the area of responsible AI licensing. See also [the article about the BLOOM Open RAIL license](https://bigscience.huggingface.co/blog/the - bigscience - rail - license) on which our license is based.
Model Description	This is a model that can be used to generate and modify images based on text prompts. It is a Latent Diffusion Model that uses a fixed, pretrained text encoder (CLIP ViT - L/14) as suggested in the Imagen paper.

Direct Use

The model is for research purposes only. Possible research areas and tasks include:

Generation of artworks, audio, and use in creative processes.
Applications in educational or creative tools.
Research on generative models.

Datasets

The original Stable Diffusion v1.5 was trained on the LAION - 5B dataset using the CLIP text encoder. It provides a great starting point with an in - depth understanding of language, including musical concepts. The team at LAION also compiled a great audio dataset from many general, speech, and music sources, which is recommended at [LAION - AI/audio - dataset](https://github.com/LAION - AI/audio - dataset/blob/main/data_collection/README.md).

Fine Tuning

Check out the diffusers training examples from Hugging Face. Fine tuning requires a dataset of spectrogram images of short audio clips, with associated text describing them. Note that the CLIP encoder can understand and connect many words even if they never appear in the dataset. It is also possible to use a dreambooth method to get custom styles.

License Information

⚠️ Important Note

This model is open access and available to all, with a CreativeML OpenRAIL - M license further specifying rights and usage. The CreativeML OpenRAIL License specifies:

You can't use the model to deliberately produce nor share illegal or harmful outputs or content.

Riffusion claims no rights on the outputs you generate, you are free to use them and are accountable for their use which must not go against the provisions set in the license.

You may re - distribute the weights and use the model commercially and/or as a service. If you do, please be aware you have to include the same use restrictions as the ones in the license and share a copy of the CreativeML OpenRAIL - M to all your users (please read the license entirely and carefully). Please read the full license carefully here: https://huggingface.co/spaces/CompVis/stable - diffusion - license

🔧 Technical Details

The model is a latent text - to - image diffusion model. It is based on fine - tuning the Stable - Diffusion - v1 - 5 checkpoint. It uses a fixed, pretrained text encoder (CLIP ViT - L/14) as suggested in the Imagen paper. The original Stable Diffusion v1.5 was trained on the LAION - 5B dataset with the CLIP text encoder. Fine - tuning requires a dataset of spectrogram images of short audio clips and associated text descriptions.

📄 License

This model is released under the [CreativeML OpenRAIL M license](https://huggingface.co/spaces/CompVis/stable - diffusion - license).

📖 Citation

If you build on this work, please cite it as follows:

@article{Forsgren_Martiros_2022,
  author = {Forsgren, Seth* and Martiros, Hayk*},
  title = {{Riffusion - Stable diffusion for real-time music generation}},
  url = {https://riffusion.com/about},
  year = {2022}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご