Bark Open-Source Text-to-Audio Model - Free Generation of Highly Realistic Multilingual Voices and Sound Effects

Handler

Developed by walterheart

Bark is a Transformer-based text-to-audio model created by Suno, capable of generating highly realistic multilingual speech, music, background noise, and sound effects.

Speech Synthesis

PyTorch

Supports Multiple LanguagesOpen Source License:MIT #Multilingual Speech Synthesis #Emotional Sound Effects Generation #High-Fidelity Audio

Downloads 20

Release Time : 4/30/2025

Model Overview

Bark is an advanced text-to-speech model that can generate multilingual speech, music, background noise, and simple sound effects, and also supports nonverbal communication such as laughter, sighs, and crying.

Model Features

Multilingual Support

Supports speech generation in 13 languages, including major European and Asian languages

Versatile Audio Generation

Capable of generating not only speech but also music, background noise, and simple sound effects

Nonverbal Communication

Can generate nonverbal communication sounds such as laughter, sighs, and crying

High-Quality Output

Generates high-quality audio with a 24kHz sampling rate

Model Capabilities

Text-to-Speech

Multilingual Speech Synthesis

Background Music Generation

Sound Effects Generation

Nonverbal Sound Generation

Use Cases

Assistive Tools

Voice Assistive Applications

Provides voice output for visually impaired individuals or those with reading difficulties

Highly realistic voice output

Content Creation

Podcasts and Audiobooks

Automatically generates multilingual audio content and narration

Natural and fluent voice output

Game Sound Effects

Generates background music and sound effects for games

Diverse audio effects

🚀 Bark

Bark is a transformer-based text-to-audio model developed by Suno. It can generate highly realistic, multilingual speech, as well as other audio including music, background noise, and simple sound effects. Additionally, the model can produce nonverbal communications such as laughing, sighing, and crying. To support the research community, we offer access to pretrained model checkpoints ready for inference.

The original GitHub repository and model card can be found here.

This model is intended for research purposes only. The model output is not censored, and the authors do not endorse the opinions in the generated content. Use at your own risk.

Two checkpoints are released:

🚀 Quick Start

You can try out Bark through the following ways:

Bark Colab:
Hugging Face Colab:
Hugging Face Demo:

✨ Features

Generate highly realistic, multilingual speech.
Produce other audio including music, background noise, and simple sound effects.
Create nonverbal communications like laughing, sighing, and crying.

📦 Installation

Transformers Installation

You can run Bark locally with the 🤗 Transformers library from version 4.31.0 onwards.

First install the 🤗 Transformers library and scipy:

pip install --upgrade pip
pip install --upgrade transformers scipy

Suno Installation

You can also run Bark locally through the original Bark library:

First install the bark library

💻 Usage Examples

Transformers Usage

Basic Usage

Run inference via the Text-to-Speech (TTS) pipeline.

from transformers import pipeline
import scipy

synthesiser = pipeline("text-to-speech", "suno/bark")

speech = synthesiser("Hello, my dog is cooler than you!", forward_params={"do_sample": True})

scipy.io.wavfile.write("bark_out.wav", rate=speech["sampling_rate"], data=speech["audio"])

Advanced Usage

Run inference via the Transformers modelling code for more fine-grained control.

from transformers import AutoProcessor, AutoModel

processor = AutoProcessor.from_pretrained("suno/bark")
model = AutoModel.from_pretrained("suno/bark")

inputs = processor(
    text=["Hello, my name is Suno. And, uh — and I like pizza. [laughs] But I also have other interests such as playing tic tac toe."],
    return_tensors="pt",
)

speech_values = model.generate(**inputs, do_sample=True)

Listening or Saving Speech Samples

from IPython.display import Audio

sampling_rate = model.generation_config.sample_rate
Audio(speech_values.cpu().numpy().squeeze(), rate=sampling_rate)

Or save them as a .wav file using a third-party library, e.g. scipy:

import scipy

sampling_rate = model.config.sample_rate
scipy.io.wavfile.write("bark_out.wav", rate=sampling_rate, data=speech_values.cpu().numpy().squeeze())

Suno Usage

from bark import SAMPLE_RATE, generate_audio, preload_models
from IPython.display import Audio

# download and load all models
preload_models()

# generate audio from text
text_prompt = """
     Hello, my name is Suno. And, uh — and I like pizza. [laughs] 
     But I also have other interests such as playing tic tac toe.
"""
speech_array = generate_audio(text_prompt)

# play text in notebook
Audio(speech_array, rate=SAMPLE_RATE)

To save audio_array as a WAV file:

from scipy.io.wavfile import write as write_wav

write_wav("/path/to/audio.wav", SAMPLE_RATE, audio_array)

📚 Documentation

For more details on using the Bark model for inference using the 🤗 Transformers library, refer to the Bark docs.

🔧 Technical Details

Model Structure

Bark is a series of three transformer models that turn text into audio.

Text to semantic tokens:
- Input: text, tokenized with BERT tokenizer from Hugging Face
- Output: semantic tokens that encode the audio to be generated
Semantic to coarse tokens:
- Input: semantic tokens
- Output: tokens from the first two codebooks of the EnCodec Codec from facebook
Coarse to fine tokens:
- Input: the first two codebooks from EnCodec
- Output: 8 codebooks from EnCodec

Architecture

Model	Parameters	Attention	Output Vocab size
Text to semantic tokens	80/300 M	Causal	10,000
Semantic to coarse tokens	80/300 M	Causal	2x 1,024
Coarse to fine tokens	80/300 M	Non-causal	6x 1,024

Release date

April 2023

📄 License

This project is licensed under the MIT license.

⚠️ Important Note

This model is meant for research purposes only. The model output is not censored and the authors do not endorse the opinions in the generated content. Use at your own risk.

💡 Usage Tip

We anticipate that this model's text to audio capabilities can be used to improve accessibility tools in a variety of languages. While we hope that this release will enable users to express their creativity and build applications that are a force for good, we acknowledge that any text to audio model has the potential for dual use. To further reduce the chances of unintended use of Bark, we also release a simple classifier to detect Bark-generated audio with high accuracy (see notebooks section of the main repository).

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご