InspireMusic-1.5B-24kHz Open-source Music Generation Model - Free High-quality Long Audio Creation

Inspiremusic 1.5B 24kHz

Developed by FunAudioLLM

InspireMusic is a unified framework focused on music generation, song generation, and audio generation, integrating autoregressive transformers with flow-matching models through audio tokenization technology, supporting high-quality long audio generation.

Audio Generation

Safetensors

EnglishOpen Source License:Apache-2.0 #High-fidelity music generation #Long-form music composition #Multi-task audio generation

Downloads 62

Release Time : 1/17/2025

Model Overview

InspireMusic is a unified framework for music, song, and audio generation that integrates autoregressive transformers with flow-matching models through audio tokenization technology. This toolkit provides AI-based training and inference code for generating high-quality music, supporting controllable generation of music, songs, and audio through text and audio prompts.

Model Features

Unified Framework

Integrates audio tokenizer, autoregressive transformer, and super-resolution flow-matching model to support multiple audio generation tasks.

High-Quality Music Generation

Capable of generating high-fidelity music, supporting 24kHz mono and 48kHz stereo output.

Long-Form Generation Capability

Specially optimized 1.5B parameter model supports music generation lasting several minutes.

Multimodal Input

Supports controllable generation through text and audio prompts.

Model Capabilities

Text-to-music generation

Music continuation

Music reconstruction

Music super-resolution

Use Cases

Music Creation

Background Music Generation

Generate background music tailored to the ambiance of venues such as restaurants and spas.

Produces high-quality music with specific styles (e.g., Bossa Nova, Jazz).

Music Continuation

Extend existing music segments with coherent compositions.

Maintains the original music style and quality in the continuation.

Audio Processing

Music Super-Resolution

Enhance the resolution and audio quality of low-quality audio.

Generates audio with higher sampling rates and richer acoustic details.

🚀 InspireMusic

InspireMusic is a unified framework dedicated to music, song, and audio generation. It empowers users to create high - quality music through AI - based generative models, offering both training and inference capabilities, and supporting various music - related tasks.

GitHub Repo stars Please support our community by starring it.

✨ Features

InspireMusic focuses on music generation, song generation, and audio generation.

A unified toolkit designed for music, song, and audio generation.
Music generation tasks with high audio quality.
Long - form music generation.

📚 Documentation

⚠️ Important Note

This repo contains the algorithm infrastructure and some simple examples. Currently only support English text prompts.

💡 Usage Tip

To preview the performance, please refer to InspireMusic Demo Page.

InspireMusic is a unified music, song, and audio generation framework through the audio tokenization integrated with autoregressive transformer and flow - matching based model. The original motive of this toolkit is to empower the common users to innovate soundscapes and enhance euphony in research through music, song, and audio crafting. The toolkit provides both training and inference codes for AI - based generative models that create high - quality music. Featuring a unified framework, InspireMusic incorporates audio tokenizers with autoregressive transformer and super - resolution flow - matching modeling, allowing for the controllable generation of music, song, and audio with both text and audio prompts. The toolkit currently supports music generation, will support song generation, audio generation in the future.

InspireMusic

Figure 1: An overview of the InspireMusic framework. We introduce InspireMusic, a unified framework for music, song, audio generation capable of producing high - quality long - form audio. InspireMusic consists of the following three key components. Audio Tokenizers convert the raw audio waveform into discrete audio tokens that can be efficiently processed and trained by the autoregressive transformer model. Audio waveform of lower sampling rate has converted to discrete tokens via a high bitrate compression audio tokenizer^[1]. Autoregressive Transformer model is based on Qwen2.5^[2] as the backbone model and is trained using a next - token prediction approach on both text and audio tokens, enabling it to generate coherent and contextually relevant token sequences. The audio and text tokens are the inputs of an autoregressive model with the next token prediction to generate tokens. Super - Resolution Flow - Matching Model based on flow modeling method, maps the generated tokens to latent features with high - resolution fine - grained acoustic details^[3] obtained from a higher sampling rate of audio to ensure the acoustic information flow connected with high fidelity through models. A vocoder then generates the final audio waveform from these enhanced latent features. InspireMusic supports a range of tasks including text - to - music, music continuation, music reconstruction, and music super - resolution.

📦 Installation

Clone

Clone the repo

git clone --recursive https://github.com/FunAudioLLM/InspireMusic.git
# If you failed to clone submodule due to network failures, please run the following command until success
cd InspireMusic
git submodule update --recursive
# or you can download the third_party repo Matcha-TTS manually
cd third_party && git clone https://github.com/shivammehta25/Matcha-TTS.git

Install from Source

InspireMusic requires Python>=3.8, PyTorch>=2.0.1, flash attention==2.6.2/2.6.3, CUDA>=11.2. You can install the dependencies with the following commands:

Install Conda: please see https://docs.conda.io/en/latest/miniconda.html
Create Conda env:

conda create -n inspiremusic python=3.8
conda activate inspiremusic
cd InspireMusic
# pynini is required by WeTextProcessing, use conda to install it as it can be executed on all platforms.
conda install -y -c conda-forge pynini==2.1.5
pip install -r requirements.txt -i https://mirrors.aliyun.com/pypi/simple/ --trusted-host=mirrors.aliyun.com
# install flash attention to speedup training
pip install flash-attn --no-build-isolation

Install within the package:

cd InspireMusic
# You can run to install the packages
python setup.py install
pip install flash-attn --no-build-isolation

We also recommend having sox or ffmpeg installed, either through your system or Anaconda:

# # Install sox
# ubuntu
sudo apt-get install sox libsox-dev
# centos
sudo yum install sox sox-devel

# Install ffmpeg
# ubuntu
sudo apt-get install ffmpeg
# centos
sudo yum install ffmpeg

Use Docker

Run the following command to build a docker image from Dockerfile provided.

docker build -t inspiremusic .

Run the following command to start the docker container in interactive mode.

docker run -ti --gpus all -v .:/workspace/InspireMusic inspiremusic

Use Docker Compose

Run the following command to build a docker compose environment and docker image from the docker-compose.yml file.

docker compose up -d --build

Run the following command to attach to the docker container in interactive mode.

docker exec -ti inspire-music bash

🚀 Quick Start

Quick Example Inference Script for Music Generation

cd InspireMusic
mkdir -p pretrained_models

# Download models
# ModelScope
git clone https://www.modelscope.cn/iic/InspireMusic-1.5B-Long.git pretrained_models/InspireMusic-1.5B-Long
# HuggingFace
git clone https://huggingface.co/FunAudioLLM/InspireMusic-1.5B-Long.git pretrained_models/InspireMusic-1.5B-Long

cd examples/music_generation
# run a quick inference example
sh infer_1.5b_long.sh

Quick Start Running Script for Music Generation Task

cd InspireMusic/examples/music_generation/
sh run.sh

One-line Inference

Text-to-music Task

One-line Shell script for text-to-music task.

cd examples/music_generation
# with flow matching, use one-line command to get a quick try
python -m inspiremusic.cli.inference

# custom the config like the following one-line command
python -m inspiremusic.cli.inference --task text-to-music -m "InspireMusic-1.5B-Long" -g 0 -t "Experience soothing and sensual instrumental jazz with a touch of Bossa Nova, perfect for a relaxing restaurant or spa ambiance." -c intro -s 0.0 -e 30.0 -r "exp/inspiremusic" -o output -f wav 

# without flow matching, use one-line command to get a quick try
python -m inspiremusic.cli.inference --task text-to-music -g 0 -t "Experience soothing and sensual instrumental jazz with a touch of Bossa Nova, perfect for a relaxing restaurant or spa ambiance." --fast True

Alternatively, you can run the inference with just a few lines of Python code.

from inspiremusic.cli.inference import InspireMusic
from inspiremusic.cli.inference import env_variables
if __name__ == "__main__":
  env_variables()
  model = InspireMusic(model_name = "InspireMusic-Base")
  model.inference("text-to-music", "Experience soothing and sensual instrumental jazz with a touch of Bossa Nova, perfect for a relaxing restaurant or spa ambiance.")

Music Continuation Task

One-line Shell script for music continuation task.

cd examples/music_generation
# with flow matching
python -m inspiremusic.cli.inference --task continuation -g 0 -a audio_prompt.wav
# without flow matching
python -m inspiremusic.cli.inference --task continuation -g 0 -a audio_prompt.wav --fast True

Alternatively, you can run the inference with just a few lines of Python code.

from inspiremusic.cli.inference import InspireMusic
from inspiremusic.cli.inference import env_variables
if __name__ == "__main__":
  env_variables()
  model = InspireMusic(model_name = "InspireMusic-Base")
  # just use audio prompt
  model.inference("continuation", None, "audio_prompt.wav")
  # use both text prompt and audio prompt
  model.inference("continuation", "Continue to generate jazz music.", "audio_prompt.wav")

💻 Usage Examples

Basic Usage

from inspiremusic.cli.inference import InspireMusic
from inspiremusic.cli.inference import env_variables
if __name__ == "__main__":
  env_variables()
  model = InspireMusic(model_name = "InspireMusic-Base")
  model.inference("text-to-music", "Experience soothing and sensual instrumental jazz with a touch of Bossa Nova, perfect for a relaxing restaurant or spa ambiance.")

Advanced Usage

# Advanced scenario: Customize more parameters for text - to - music task
from inspiremusic.cli.inference import InspireMusic
from inspiremusic.cli.inference import env_variables
if __name__ == "__main__":
  env_variables()
  model = InspireMusic(model_name = "InspireMusic-1.5B-Long")
  model.inference("text-to-music", "Experience soothing and sensual instrumental jazz with a touch of Bossa Nova, perfect for a relaxing restaurant or spa ambiance.", gpu=0, chorus="intro", start_time=0.0, end_time=30.0, result_dir="exp/inspiremusic", output_name="output", file_format="wav")

🔧 Technical Details

Training

Train LLM Model

torchrun --nnodes=1 --nproc_per_node=8 \
    --rdzv_id=1024 --rdzv_backend="c10d" --rdzv_endpoint="localhost:0" \
    inspiremusic/bin/train.py \
    --train_engine "torch_ddp" \
    --config conf/inspiremusic.yaml \
    --train_data data/train.data.list \
    --cv_data data/dev.data.list \
    --model llm \
    --model_dir `pwd`/exp/music_generation/llm/ \
    --tensorboard_dir `pwd`/tensorboard/music_generation/llm/ \
    --ddp.dist_backend "nccl" \
    --num_workers 8 \
    --prefetch 100 \
    --pin_memory \
    --deepspeed_config ./conf/ds_stage2.json \
    --deepspeed.save_states model+optimizer \
    --fp16

Train Flow Matching Model

torchrun --nnodes=1 --nproc_per_node=8 \
    --rdzv_id=1024 --rdzv_backend="c10d" --rdzv_endpoint="localhost:0" \
    inspiremusic/bin/train.py \
    --train_engine "torch_ddp" \
    --config conf/inspiremusic.yaml \
    --train_data data/train.data.list \
    --cv_data data/dev.data.list \
    --model flow \
    --model_dir `pwd`/exp/music_generation/flow/ \
    --tensorboard_dir `pwd`/tensorboard/music_generation/flow/ \
    --ddp.dist_backend "nccl" \
    --num_workers 8 \
    --prefetch 100 \
    --pin_memory \
    --deepspeed_config ./conf/ds_stage2.json \
    --deepspeed.save_states model+optimizer

Inference

Quick Inference Script

cd InspireMusic/examples/music_generation/
sh infer.sh

Normal Mode Inference

pretrained_model_dir = "pretrained_models/InspireMusic/"
for task in 'text-to-music' 'continuation'; do
  python inspiremusic/bin/inference.py --task $task \
      --gpu 0 \
      --config conf/inspiremusic.yaml \
      --prompt_data data/test/parquet/data.list \
      --flow_model $pretrained_model_dir/flow.pt \
      --llm_model $pretrained_model_dir/llm.pt \
      --music_tokenizer $pretrained_model_dir/music_tokenizer \
      --wavtokenizer $pretrained_model_dir/wavtokenizer \
      --result_dir `pwd`/exp/inspiremusic/${task}_test \
      --chorus verse 
done

Fast Mode Inference

pretrained_model_dir = "pretrained_models/InspireMusic/"
for task in 'text-to-music' 'continuation'; do
  python inspiremusic/bin/inference.py --task $task \
      --gpu 0 \
      --config conf/inspiremusic.yaml \
      --prompt_data data/test/parquet/data.list \
      --flow_model $pretrained_model_dir/flow.pt \
      --llm_model $pretrained_model_dir/llm.pt \
      --music_tokenizer $pretrained_model_dir/music_tokenizer \
      --wavtokenizer $pretrained_model_dir/wavtokenizer \
      --result_dir `pwd`/exp/inspiremusic/${task}_test \
      --chorus verse \
      --fast 
done

Hardware requirements

Previous test on H800 GPU, InspireMusic could generate 30 seconds audio with real - time factor (RTF) around 1.6~1.8. For normal mode, we recommend using hardware with at least 24GB of GPU memory for better experience. For fast mode, 12GB GPU mem.

📦 Model Zoo

Download Models

# use git to download models，please make sure git lfs is installed.
mkdir -p pretrained_models
git clone https://www.modelscope.cn/iic/InspireMusic.git pretrained_models/InspireMusic

Available Models

Property	Details
InspireMusic - Base - 24kHz	. Pre - trained Music Generation Model, 24kHz mono, 30s
InspireMusic - Base	. Pre - trained Music Generation Model, 48kHz, 30s
InspireMusic - 1.5B - 24kHz	. Pre - trained Music Generation 1.5B Model, 24kHz mono, 30s
InspireMusic - 1.5B	. Pre - trained Music Generation 1.5B Model, 48kHz, 30s
InspireMusic - 1.5B - Long	. Pre - trained Music Generation 1.5B Model, 48kHz, support long - form music generation up to several minutes
InspireSong - 1.5B	. Pre - trained Song Generation 1.5B Model, 48kHz stereo
InspireAudio - 1.5B	. Pre - trained Audio Generation 1.5B Model, 48kHz stereo
Wavtokenizer^[1] (75Hz)	. An extreme low bitrate audio tokenizer for music with one codebook at 24kHz audio.
Music_tokenizer (75Hz)	. A music tokenizer based on HifiCodec^[3] at 24kHz audio.
Music_tokenizer (150Hz)	. A music tokenizer based on HifiCodec^[3] at 48kHz audio.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご