tango - full Open-source Text-to-Audio Model - Freely Generate Audio such as Lifelike Voices and Sound Effects

Tango Full

Developed by declare-lab

TANGO is a latent diffusion model-based text-to-audio generation tool capable of producing realistic audio including human voices, animal sounds, and natural/artificial sound effects based on text prompts.

Audio Generation

Transformers

English#Instruction-guided audio generation #Diffusion model audio synthesis #Multi-scenario sound effects generation

Downloads 15

Release Time : 5/30/2023

Model Overview

TANGO employs a frozen parameter instruction-tuned large language model Flan-T5 as text encoder, and trains a UNet-architecture diffusion model for audio generation. It surpasses current state-of-the-art audio generation models in both objective metrics and subjective evaluations.

Model Features

High-quality audio generation

Capable of generating realistic audio including human voices, animal sounds, and natural/artificial sound effects

Instruction-guided diffusion

Uses instruction-tuned large language model Flan-T5 as text encoder for precise text-to-audio conversion

Surpasses SOTA performance

Outperforms current state-of-the-art audio generation models in both objective metrics and subjective evaluations

Model Capabilities

Text-to-audio generation

Multi-category sound synthesis

High-quality audio rendering

Use Cases

Entertainment & Media

Sound effects production

Quickly generate high-quality sound effects for films, games and other content

Produces realistic environmental sound effects and special effect sounds

Education

Teaching assistance

Generate accompanying audio for educational content

Creates vivid teaching audio materials

🚀 TANGO: Text to Audio using iNstruction-Guided diffusiOn

TANGO is a latent diffusion model designed for text-to-audio generation. It can generate realistic audios, including human sounds, animal sounds, natural and artificial sounds, and sound effects, based on textual prompts. We utilize the frozen instruction-tuned LLM Flan-T5 as the text encoder and train a UNet-based diffusion model for audio generation. Our model outperforms current state-of-the-art models for audio generation in both objective and subjective metrics. We are releasing our model, training and inference code, and pre-trained checkpoints for the research community.

📣 We recently released Tango 2. Access it here.

📣 We are releasing Tango-Full, which was pre-trained on TangoPromptBank.

🚀 Quick Start

Download the model and generate audio

Download the TANGO model and generate audio from a text prompt:

import IPython
import soundfile as sf
from tango import Tango

tango = Tango("declare-lab/tango-full-ft-audiocaps")

prompt = "An audience cheering and clapping"
audio = tango.generate(prompt)
sf.write(f"{prompt}.wav", audio, samplerate=16000)
IPython.display.Audio(data=audio, rate=16000)

An audience cheering and clapping.webm

The model will be automatically downloaded and saved in cache. Subsequent runs will load the model directly from cache.

Adjust the number of steps

The generate function uses 100 steps by default to sample from the latent diffusion model. We recommend using 200 steps for generating better quality audios. This comes at the cost of increased run-time.

prompt = "Rolling thunder with lightning strikes"
audio = tango.generate(prompt, steps=200)
IPython.display.Audio(data=audio, rate=16000)

Rolling thunder with lightning strikes.webm

Generate multiple audio samples

Use the generate_for_batch function to generate multiple audio samples for a batch of text prompts:

prompts = [
    "A car engine revving",
    "A dog barks and rustles with some clicking",
    "Water flowing and trickling"
]
audios = tango.generate_for_batch(prompts, samples=2)

This will generate two samples for each of the three text prompts.

📦 Installation

Please follow the instructions in the repository for installation, usage and experiments. Our code is released here: https://github.com/declare-lab/tango

💻 Usage Examples

Basic Usage

import IPython
import soundfile as sf
from tango import Tango

tango = Tango("declare-lab/tango-full-ft-audiocaps")

prompt = "An audience cheering and clapping"
audio = tango.generate(prompt)
sf.write(f"{prompt}.wav", audio, samplerate=16000)
IPython.display.Audio(data=audio, rate=16000)

Advanced Usage

# Use 200 steps to generate better quality audio
prompt = "Rolling thunder with lightning strikes"
audio = tango.generate(prompt, steps=200)
IPython.display.Audio(data=audio, rate=16000)

# Generate multiple audio samples for a batch of text prompts
prompts = [
    "A car engine revving",
    "A dog barks and rustles with some clicking",
    "Water flowing and trickling"
]
audios = tango.generate_for_batch(prompts, samples=2)

📄 License

This project is licensed under the CC BY-NC-SA 4.0 license.

Property	Details
Model Type	Latent diffusion model
Training Data	declare-lab/TangoPromptBank

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご