TangoFlux Open-Source Text-to-Speech Model - Rapidly Generate High-Quality Audio, Free Deployment and Super Practical!

Tangoflux

Developed by declare-lab

TangoFlux is an efficient text-to-audio generation system that combines flow matching and CLAP preference optimization technologies to quickly produce high-quality audio.

Audio Generation #Ultra-fast audio generation #High-fidelity text-to-audio #Flow matching technology

Downloads 727

Release Time : 12/24/2024

Model Overview

TangoFlux generates audio within 44.1kHz/30 seconds through the FluxTransformer module (including diffusion transformer and multimodal diffusion transformer), supporting text prompts and duration embeddings.

Model Features

Ultra-fast generation

Capable of generating high-quality audio in a short time, defaulting to 25 steps, with 50 steps recommended for higher quality.

High-fidelity audio

Supports 44.1kHz sampling rate, generating audio within 30 seconds while ensuring audio quality.

Multimodal support

Generates audio through text prompts and duration embeddings, supporting multimodal input.

Three-stage training process

Includes pre-training, fine-tuning, and preference optimization stages, utilizing the CRPO method to optimize model performance.

Model Capabilities

Text-to-audio generation

High-fidelity audio generation

Multimodal input processing

Use Cases

Creative content generation

Sound effect generation

Generates specific sound effects based on text descriptions, such as 'a hammer slowly hitting a wooden table'.

Produces high-quality audio files that match the description.

Multimedia applications

Background music generation

Generates background music for videos or games.

Produces background music that matches the scene.

🚀 TangoFlux: Super Fast and Faithful Text to Audio Generation with Flow Matching and Clap-Ranked Preference Optimization

TangoFlux is a powerful text - to - audio generation model. It can generate high - quality audio quickly and accurately, addressing the need for efficient text - to - audio conversion.

✨ Features

TangoFlux consists of FluxTransformer blocks (Diffusion Transformer (DiT) and Multimodal Diffusion Transformer (MMDiT)). It generates 44.1kHz audio up to 30 seconds, conditioned on textual prompts and duration embeddings.
It learns a rectified flow trajectory from audio latent representations encoded by a variational autoencoder (VAE).
The training pipeline has three stages: pre - training, fine - tuning, and preference optimization.
It is aligned via CRPO, which iteratively generates new synthetic data and constructs preference pairs for preference optimization.

📦 Installation

Get TangoFlux from our GitHub repo https://github.com/declare - lab/TangoFlux with

pip install git+https://github.com/declare-lab/TangoFlux

The model will be automatically downloaded and saved in a cache. Subsequent runs will load the model directly from the cache.

💻 Usage Examples

Basic Usage

The generate function uses 25 steps by default to sample from the flow model. We recommend using 50 steps for generating better quality audios. This comes at the cost of increased run - time.

import torchaudio
from tangoflux import TangoFluxInference
from IPython.display import Audio

model = TangoFluxInference(name='declare-lab/TangoFlux')
audio = model.generate('Hammer slowly hitting the wooden table', steps=50, duration=10)

Audio(data=audio, rate=44100)

📄 License

The TangoFlux checkpoints are for non - commercial research use only. They are subject to the [Stable Audio Open’s license](https://huggingface.co/stabilityai/stable - audio - open - 1.0/blob/main/LICENSE.md), [WavCap’s license](https://github.com/XinhaoMei/WavCaps?tab=readme - ov - file#license), and the original licenses accompanying each training dataset.

📚 Documentation

Datasets

cvssp/WavCaps
declare - lab/CRPO

Pipeline Tag

text - to - audio

Citation

https://arxiv.org/abs/2412.21037

@misc{hung2024tangofluxsuperfastfaithful,
      title={TangoFlux: Super Fast and Faithful Text to Audio Generation with Flow Matching and Clap-Ranked Preference Optimization}, 
      author={Chia - Yu Hung and Navonil Majumder and Zhifeng Kong and Ambuj Mehrish and Rafael Valle and Bryan Catanzaro and Soujanya Poria},
      year={2024},
      eprint={2412.21037},
      archivePrefix={arXiv},
      primaryClass={cs.SD},
      url={https://arxiv.org/abs/2412.21037}, 
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご