Bark-small Open-source Text-to-Audio Model - Free Deployment to Generate Multilingual Voices and Sound Effects

Bark Small

Developed by ylacombe

Bark is a Transformer-based text-to-audio model created by Suno, capable of generating highly realistic multilingual speech, music, background noise, and simple sound effects.

Speech Synthesis

Transformers

Supports Multiple Languages#Multilingual speech synthesis #Emotional sound effects generation #High-fidelity audio

Downloads 1,947

Release Time : 6/16/2023

Model Overview

Bark is a text-to-audio model that can generate multilingual speech, music, background noise, and simple sound effects, as well as produce non-verbal communications such as laughter, sighs, and crying.

Model Features

Multilingual support

Supports text-to-speech in 13 languages, including Chinese, English, Japanese, etc.

Versatile audio generation

Capable of generating not only speech but also music, background noise, and simple sound effects.

Non-verbal communication

Can produce non-verbal communication sounds such as laughter, sighs, and crying.

Highly realistic

Generated speech and audio have highly realistic effects.

Model Capabilities

Text-to-Speech

Music generation

Background noise generation

Simple sound effects generation

Non-verbal communication generation

Use Cases

Assistive tools

Multilingual voice assistance

Provides voice assistance for users in different languages.

Highly realistic voice output

Content creation

Audio content generation

Generates background music and sound effects for videos, podcasts, and other content.

Diverse audio output

🚀 Bark

Bark is a transformer-based text-to-audio model developed by Suno. It can generate highly realistic, multilingual speech, as well as other audio types including music, background noise, and simple sound effects. Additionally, the model can produce nonverbal communications like laughing, sighing, and crying. To support the research community, we are providing access to pretrained model checkpoints ready for inference.

The original GitHub repository and model card can be found here.

This model is intended for research purposes only. The model output is not censored, and the authors do not endorse the opinions in the generated content. Use at your own risk.

Two checkpoints are released:

🚀 Quick Start

You can try out Bark through the following methods:

Bark Colab:
Hugging Face Colab:
Hugging Face Demo:

✨ Features

Generate highly realistic, multilingual speech.
Produce various types of audio, such as music, background noise, and simple sound effects.
Create nonverbal communications like laughing, sighing, and crying.

📦 Installation

🤗 Transformers Installation

You can run Bark locally with the 🤗 Transformers library from version 4.31.0 onwards.

First, install the 🤗 Transformers library from the main branch:

pip install git+https://github.com/huggingface/transformers.git

Suno Library Installation

You can also run Bark locally through the original Bark library.

Install the bark library.

💻 Usage Examples

🤗 Transformers Usage

Run the following Python code to generate speech samples:

from transformers import AutoProcessor, AutoModel

processor = AutoProcessor.from_pretrained("suno/bark-small")
model = AutoModel.from_pretrained("suno/bark-small")

inputs = processor(
    text=["Hello, my name is Suno. And, uh — and I like pizza. [laughs] But I also have other interests such as playing tic tac toe."],
    return_tensors="pt",
)

speech_values = model.generate(**inputs, do_sample=True)

Listen to the speech samples either in an ipynb notebook:

from IPython.display import Audio

sampling_rate = model.generation_config.sample_rate
Audio(speech_values.cpu().numpy().squeeze(), rate=sampling_rate)

Or save them as a .wav file using a third - party library, e.g., scipy:

import scipy

sampling_rate = model.config.sample_rate
scipy.io.wavfile.write("bark_out.wav", rate=sampling_rate, data=speech_values.cpu().numpy().squeeze())

Suno Usage

Run the following Python code:

from bark import SAMPLE_RATE, generate_audio, preload_models
from IPython.display import Audio

# download and load all models
preload_models()

# generate audio from text
text_prompt = """
     Hello, my name is Suno. And, uh — and I like pizza. [laughs] 
     But I also have other interests such as playing tic tac toe.
"""
speech_array = generate_audio(text_prompt)

# play text in notebook
Audio(speech_array, rate=SAMPLE_RATE)

To save audio_array as a WAV file:

from scipy.io.wavfile import write as write_wav

write_wav("/path/to/audio.wav", SAMPLE_RATE, audio_array)

📚 Documentation

For more details on using the Bark model for inference with the 🤗 Transformers library, refer to the Bark docs.

🔧 Technical Details

Model Structure

Bark consists of three transformer models that convert text into audio.

Text to semantic tokens

Input: Text, tokenized with the BERT tokenizer from Hugging Face.
Output: Semantic tokens that encode the audio to be generated.

Semantic to coarse tokens

Input: Semantic tokens.
Output: Tokens from the first two codebooks of the EnCodec Codec from Facebook.

Coarse to fine tokens

Input: The first two codebooks from EnCodec.
Output: 8 codebooks from EnCodec.

Architecture

Model	Parameters	Attention	Output Vocab size
Text to semantic tokens	80/300 M	Causal	10,000
Semantic to coarse tokens	80/300 M	Causal	2x 1,024
Coarse to fine tokens	80/300 M	Non-causal	6x 1,024

Release Date

April 2023

📄 License

This model is released under the "cc-by-nc-4.0" license.

⚠️ Important Note

This model is meant for research purposes only. The model output is not censored, and the authors do not endorse the opinions in the generated content. Use at your own risk.

💡 Usage Tip

We also release a simple classifier to detect Bark-generated audio with high accuracy (see the notebooks section of the main repository) to further reduce the chances of unintended use of Bark.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご