đ TANGO: Text to Audio using iNstruction-Guided diffusiOn
TANGO is a latent diffusion model designed for text-to-audio generation. It can generate realistic audios, including human voices, animal sounds, natural and artificial noises, as well as sound effects, based on textual prompts. We utilize the frozen instruction-tuned LLM Flan - T5 as the text encoder and train a UNet-based diffusion model for audio generation. Our model outperforms current state-of-the-art audio generation models in both objective and subjective metrics. We are releasing our model, training and inference code, and pre-trained checkpoints for the research community.
đŖ We are releasing Tango-Full-FT-Audiocaps, which was first pre-trained on TangoPromptBank, a collection of diverse text - audio pairs. Subsequently, we fine - tuned this checkpoint on AudioCaps. This checkpoint achieved state-of-the-art results for text-to-audio generation on AudioCaps.
đ Quick Start
Download and Generate Audio
Download the TANGO model and generate audio from a text prompt:
import IPython
import soundfile as sf
from tango import Tango
tango = Tango("declare-lab/tango")
prompt = "An audience cheering and clapping"
audio = tango.generate(prompt)
sf.write(f"{prompt}.wav", audio, samplerate=16000)
IPython.display.Audio(data=audio, rate=16000)
An audience cheering and clapping.webm
The model will be automatically downloaded and saved in cache. Subsequent runs will load the model directly from cache.
Adjust Generation Steps
The generate
function uses 100 steps by default to sample from the latent diffusion model. We recommend using 200 steps for generating better quality audios, although this will increase the run - time.
prompt = "Rolling thunder with lightning strikes"
audio = tango.generate(prompt, steps=200)
IPython.display.Audio(data=audio, rate=16000)
Rolling thunder with lightning strikes.webm
Generate Audio for a Batch of Prompts
Use the generate_for_batch
function to generate multiple audio samples for a batch of text prompts:
prompts = [
"A car engine revving",
"A dog barks and rustles with some clicking",
"Water flowing and trickling"
]
audios = tango.generate_for_batch(prompts, samples=2)
This will generate two samples for each of the three text prompts.
⨠Features
- Diverse Audio Generation: TANGO can generate a wide range of realistic audios, including human, animal, natural, artificial sounds, and sound effects from text prompts.
- Outperforming State - of - the - Art: It outperforms current state-of-the-art models for audio generation across both objective and subjective metrics.
- Model and Code Release: We release the model, training, inference code, and pre-trained checkpoints for the research community.
đĻ Installation
Please follow the instructions in the repository for installation: https://github.com/declare-lab/tango
đ Documentation
Code
Our code is released here: https://github.com/declare-lab/tango
We uploaded several TANGO generated samples here: https://tango-web.github.io/
Limitations
TANGO is trained on the small AudioCaps dataset, so it may not generate good audio samples related to concepts that it has not seen in training (e.g., singing). For the same reason, TANGO is not always able to finely control its generations over textual control prompts. For example, the generations from TANGO for prompts Chopping tomatoes on a wooden table and Chopping potatoes on a metal table are very similar. Chopping vegetables on a table also produces similar audio samples. Training text-to-audio generation models on larger datasets is thus required for the model to learn the composition of textual concepts and varied text - audio mappings.
We are training another version of TANGO on larger datasets to enhance its generalization, compositional, and controllable generation ability.
đ License
The model is released under the CC - BY - NC - SA 4.0 license.