Tango Open-Source Text-to-Audio Model - Generate Lifelike Voices, Sound Effects, and Other Audio Based on Text Prompts

Tango

Developed by declare-lab

TANGO is an instruction-guided diffusion model for text-to-audio generation, capable of producing realistic audio including human voices, animal sounds, and natural or artificial sound effects based on text prompts.

Audio Generation

Transformers

English#Text-to-Audio Generation #Diffusion Model Architecture #Multi-Scene Sound Effect Synthesis

Downloads 118

Release Time : 4/23/2023

Model Overview

TANGO is a latent diffusion model for text-to-audio generation, employing Flan-T5 as the text encoder and a UNet-based diffusion model for audio synthesis.

Model Features

Instruction-Guided Diffusion

Utilizes instruction-tuned large language model Flan-T5 as the text encoder for precise text-to-audio mapping

High-Quality Audio Generation

Outperforms current state-of-the-art audio generation models in both objective metrics and subjective evaluations

Diverse Sound Generation

Supports generating various types of audio, including human voices, animal sounds, and natural or artificial sound effects

Model Capabilities

Text-to-Audio Generation

Diverse Sound Synthesis

High-Fidelity Audio Generation

Use Cases

Multimedia Content Creation

Film and TV Sound Effect Generation

Automatically generates scene sound effects based on script descriptions

Produces realistic environmental and special effects sounds

Game Audio Design

Generates dynamic sound effects for game scenes

Creates immersive gaming audio experiences

Assistive Technology

Visual Impairment Assistance

Converts text descriptions into environmental sound cues

Helps visually impaired individuals understand their surroundings

🚀 TANGO: Text to Audio using iNstruction-Guided diffusiOn

TANGO is a latent diffusion model designed for text-to-audio generation. It can generate realistic audios, including human voices, animal sounds, natural and artificial noises, as well as sound effects, based on textual prompts. We utilize the frozen instruction-tuned LLM Flan - T5 as the text encoder and train a UNet-based diffusion model for audio generation. Our model outperforms current state-of-the-art audio generation models in both objective and subjective metrics. We are releasing our model, training and inference code, and pre-trained checkpoints for the research community.

📣 We are releasing Tango-Full-FT-Audiocaps, which was first pre-trained on TangoPromptBank, a collection of diverse text - audio pairs. Subsequently, we fine - tuned this checkpoint on AudioCaps. This checkpoint achieved state-of-the-art results for text-to-audio generation on AudioCaps.

🚀 Quick Start

Download and Generate Audio

Download the TANGO model and generate audio from a text prompt:

import IPython
import soundfile as sf
from tango import Tango

tango = Tango("declare-lab/tango")

prompt = "An audience cheering and clapping"
audio = tango.generate(prompt)
sf.write(f"{prompt}.wav", audio, samplerate=16000)
IPython.display.Audio(data=audio, rate=16000)

An audience cheering and clapping.webm

The model will be automatically downloaded and saved in cache. Subsequent runs will load the model directly from cache.

Adjust Generation Steps

The generate function uses 100 steps by default to sample from the latent diffusion model. We recommend using 200 steps for generating better quality audios, although this will increase the run - time.

prompt = "Rolling thunder with lightning strikes"
audio = tango.generate(prompt, steps=200)
IPython.display.Audio(data=audio, rate=16000)

Rolling thunder with lightning strikes.webm

Generate Audio for a Batch of Prompts

Use the generate_for_batch function to generate multiple audio samples for a batch of text prompts:

prompts = [
    "A car engine revving",
    "A dog barks and rustles with some clicking",
    "Water flowing and trickling"
]
audios = tango.generate_for_batch(prompts, samples=2)

This will generate two samples for each of the three text prompts.

✨ Features

Diverse Audio Generation: TANGO can generate a wide range of realistic audios, including human, animal, natural, artificial sounds, and sound effects from text prompts.
Outperforming State - of - the - Art: It outperforms current state-of-the-art models for audio generation across both objective and subjective metrics.
Model and Code Release: We release the model, training, inference code, and pre-trained checkpoints for the research community.

📦 Installation

Please follow the instructions in the repository for installation: https://github.com/declare-lab/tango

📚 Documentation

Code

Our code is released here: https://github.com/declare-lab/tango

We uploaded several TANGO generated samples here: https://tango-web.github.io/

Limitations

TANGO is trained on the small AudioCaps dataset, so it may not generate good audio samples related to concepts that it has not seen in training (e.g., singing). For the same reason, TANGO is not always able to finely control its generations over textual control prompts. For example, the generations from TANGO for prompts Chopping tomatoes on a wooden table and Chopping potatoes on a metal table are very similar. Chopping vegetables on a table also produces similar audio samples. Training text-to-audio generation models on larger datasets is thus required for the model to learn the composition of textual concepts and varied text - audio mappings.

We are training another version of TANGO on larger datasets to enhance its generalization, compositional, and controllable generation ability.

📄 License

The model is released under the CC - BY - NC - SA 4.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご