Encodec_32kHz Open-Source Audio Codec - High-Fidelity Real-Time Encoding to Boost Music Creation

Home

Encodec 32khz

Developed by facebook

High-fidelity real-time neural audio codec developed by Meta AI, specifically trained for the MusicGen project

Audio Generation

Transformers

#Real-time audio compression #High-fidelity codec #Music generation dedicated

Downloads 348.00k

Release Time : 6/15/2023

Model Overview

EnCodec is a real-time neural audio codec that leverages neural networks, supporting high-quality audio compression and efficient decoding, compatible with the MusicGen model

Model Features

High-fidelity audio compression

Utilizes end-to-end training to generate high-quality audio samples, effectively reducing artifacts

Real-time processing capability

Supports both streaming and non-streaming modes to meet various scenario requirements

Innovative training mechanism

Achieves stable and efficient training through multi-scale spectral discriminators and loss balancing mechanisms

Adjustable bandwidth

Supports specifying different bandwidths during encoding and decoding to adapt to various application scenarios

Model Capabilities

Audio compression

Audio decompression

Real-time audio processing

High-quality audio generation

Use Cases

Music generation

Used with MusicGen

Serves as the audio codec component for the MusicGen model

Enables high-quality music generation and compression

Audio processing

Standalone audio codec

Used independently for audio file compression and decompression

Provides high-fidelity audio compression

🚀 Model Card for EnCodec

This model card offers comprehensive details about EnCodec 32kHz, a cutting - edge real - time audio codec developed by Meta AI. It was specifically trained as part of the MusicGen project and is designed for use with MusicGen models.

encodec image

🚀 Quick Start

To get started with the EnCodec model, use the following code with a dummy example from the LibriSpeech dataset (~9MB). First, install the required Python packages:

pip install --upgrade pip
pip install --upgrade transformers datasets[audio]

Then load an audio sample, and run a forward pass of the model:

from datasets import load_dataset, Audio
from transformers import EncodecModel, AutoProcessor


# load a demonstration datasets
librispeech_dummy = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")

# load the model + processor (for pre-processing the audio)
model = EncodecModel.from_pretrained("facebook/encodec_48khz")
processor = AutoProcessor.from_pretrained("facebook/encodec_48khz")

# cast the audio data to the correct sampling rate for the model
librispeech_dummy = librispeech_dummy.cast_column("audio", Audio(sampling_rate=processor.sampling_rate))
audio_sample = librispeech_dummy[0]["audio"]["array"]

# pre-process the inputs
inputs = processor(raw_audio=audio_sample, sampling_rate=processor.sampling_rate, return_tensors="pt")

# explicitly encode then decode the audio inputs
encoder_outputs = model.encode(inputs["input_values"], inputs["padding_mask"])
audio_values = model.decode(encoder_outputs.audio_codes, encoder_outputs.audio_scales, inputs["padding_mask"])[0]

# or the equivalent with a forward pass
audio_values = model(inputs["input_values"], inputs["padding_mask"]).audio_values

✨ Features

Model Details

Model Description

EnCodec is a high - fidelity audio codec leveraging neural networks. It introduces a streaming encoder - decoder architecture with quantized latent space, trained in an end - to - end fashion. The model simplifies and speeds up training using a single multiscale spectrogram adversary that efficiently reduces artifacts and produces high - quality samples. It also includes a novel loss balancer mechanism that stabilizes training by decoupling the choice of hyperparameters from the typical scale of the loss. Additionally, lightweight Transformer models are used to further compress the obtained representation while maintaining real - time performance. This variant of EnCodec is trained on 20k of music data, consisting of an internal dataset of 10K high - quality music tracks, and on the ShutterStock and Pond5 music datasets.

Property	Details
Developed by	Meta AI
Model Type	Audio Codec

Model Sources

Repository: GitHub Repository
Paper: Simple and Controllable Music Generation

Uses

Direct Use

EnCodec can be used directly as an audio codec for real - time compression and decompression of audio signals. It provides high - quality audio compression and efficient decoding. The model was trained on various bandwiths, which can be specified when encoding (compressing) and decoding (decompressing). Two different setup exist for EnCodec:

Non - streamable: the input audio is split into chunks of 1 seconds, with an overlap of 10 ms, which are then encoded.
Streamable: weight normalization is used on the convolution layers, and the input is not split into chunks but rather padded on the left.

Downstream Use

This variant of EnCodec is designed to be used in conjunction with the official MusicGen checkpoints. However, it can also be used standalone to encode audio files.

🔧 Technical Details

EnCodec is a state - of - the - art real - time neural audio compression model that excels in producing high - fidelity audio samples at various sample rates and bandwidths. The model's performance was evaluated across different settings, ranging from 24kHz monophonic at 1.5 kbps to 48kHz stereophonic, showcasing both subjective and objective results. Notably, EnCodec incorporates a novel spectrogram - only adversarial loss, effectively reducing artifacts and enhancing sample quality. Training stability and interpretability were further enhanced through the introduction of a gradient balancer for the loss weights. Additionally, the study demonstrated that a compact Transformer model can be employed to achieve an additional bandwidth reduction of up to 40% without compromising quality, particularly in applications where low latency is not critical (e.g., music streaming).

📚 Documentation

For evaluation results, refer to the MusicGen evaluation scores.

📄 License

Citation

BibTeX:

@misc{copet2023simple,
      title={Simple and Controllable Music Generation}, 
      author={Jade Copet and Felix Kreuk and Itai Gat and Tal Remez and David Kant and Gabriel Synnaeve and Yossi Adi and Alexandre Défossez},
      year={2023},
      eprint={2306.05284},
      archivePrefix={arXiv},
      primaryClass={cs.SD}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご