š TANGO: Text to Audio using iNstruction-Guided diffusiOn
TANGO is a latent diffusion model designed for text-to-audio generation. It can generate realistic audios, including human sounds, animal sounds, natural and artificial sounds, and sound effects, based on textual prompts. We utilize the frozen instruction-tuned LLM Flan-T5 as the text encoder and train a UNet-based diffusion model for audio generation. Our model outperforms current state-of-the-art models for audio generation in both objective and subjective metrics. We are releasing our model, training and inference code, and pre-trained checkpoints for the research community.
š£ We recently released Tango 2. Access it here.
š£ We are releasing Tango-Full, which was pre-trained on TangoPromptBank.
š Quick Start
Download the model and generate audio
Download the TANGO model and generate audio from a text prompt:
import IPython
import soundfile as sf
from tango import Tango
tango = Tango("declare-lab/tango-full-ft-audiocaps")
prompt = "An audience cheering and clapping"
audio = tango.generate(prompt)
sf.write(f"{prompt}.wav", audio, samplerate=16000)
IPython.display.Audio(data=audio, rate=16000)
An audience cheering and clapping.webm
The model will be automatically downloaded and saved in cache. Subsequent runs will load the model directly from cache.
Adjust the number of steps
The generate
function uses 100 steps by default to sample from the latent diffusion model. We recommend using 200 steps for generating better quality audios. This comes at the cost of increased run-time.
prompt = "Rolling thunder with lightning strikes"
audio = tango.generate(prompt, steps=200)
IPython.display.Audio(data=audio, rate=16000)
Rolling thunder with lightning strikes.webm
Generate multiple audio samples
Use the generate_for_batch
function to generate multiple audio samples for a batch of text prompts:
prompts = [
"A car engine revving",
"A dog barks and rustles with some clicking",
"Water flowing and trickling"
]
audios = tango.generate_for_batch(prompts, samples=2)
This will generate two samples for each of the three text prompts.
š¦ Installation
Please follow the instructions in the repository for installation, usage and experiments. Our code is released here: https://github.com/declare-lab/tango
š» Usage Examples
Basic Usage
import IPython
import soundfile as sf
from tango import Tango
tango = Tango("declare-lab/tango-full-ft-audiocaps")
prompt = "An audience cheering and clapping"
audio = tango.generate(prompt)
sf.write(f"{prompt}.wav", audio, samplerate=16000)
IPython.display.Audio(data=audio, rate=16000)
Advanced Usage
prompt = "Rolling thunder with lightning strikes"
audio = tango.generate(prompt, steps=200)
IPython.display.Audio(data=audio, rate=16000)
prompts = [
"A car engine revving",
"A dog barks and rustles with some clicking",
"Water flowing and trickling"
]
audios = tango.generate_for_batch(prompts, samples=2)
š License
This project is licensed under the CC BY-NC-SA 4.0 license.
Property |
Details |
Model Type |
Latent diffusion model |
Training Data |
declare-lab/TangoPromptBank |