MetaVoice-1B-v0.1 Open-Source TTS Model - Generate Emotional English Voices, Support Voice Cloning and Long Text Synthesis

Metavoice 1B V0.1

Developed by metavoiceio

MetaVoice-1B is a 1.2 billion parameter text-to-speech (TTS) foundation model trained on 100,000 hours of speech data, specializing in generating emotional English speech with support for voice cloning and long-form synthesis.

Speech Synthesis EnglishOpen Source License:Apache-2.0 #Zero-shot Voice Cloning #Emotional TTS #Few-shot Fine-tuning

Downloads 571

Release Time : 2/6/2024

Model Overview

MetaVoice-1B is a foundation model designed for text-to-speech tasks, capable of generating English speech with emotional rhythm and intonation, supporting voice cloning and long-form synthesis.

Model Features

Emotional Speech Generation

Capable of generating English speech with emotional rhythm and intonation, avoiding incoherent content.

Voice Cloning

Supports voice cloning through fine-tuning, requiring only 1 minute of training data for Indian accents and just 30 seconds of reference audio for zero-shot cloning of American and British accents.

Long-form Synthesis

Supports long-form synthesis, with arbitrary-length TTS functionality coming soon.

Efficient Inference

Supports KV caching and batch processing (including texts of varying lengths) via Flash Decoding.

Model Capabilities

Text-to-Speech

Voice Cloning

Long-form Synthesis

Use Cases

Speech Synthesis

Personalized Voice Assistants

Generate personalized voices for voice assistants to enhance user experience.

Produces natural, emotional speech.

Audiobooks

Convert text content into speech for audiobook production.

Supports long-form synthesis, generating high-quality speech.

Voice Cloning

Voice Cloning Services

Clone a specific speaker's voice with minimal samples.

Requires only 1 minute of training data for Indian accents and just 30 seconds of reference audio for zero-shot cloning of American and British accents.

🚀 MetaVoice-1B

MetaVoice-1B is a 1.2B parameter base model trained on 100K hours of speech for TTS (text-to-speech). It offers emotional speech rhythm and tone in English, voice cloning support, zero-shot cloning for American and British voices, and long-form synthesis capabilities.

✨ Features

Emotional Speech: Delivers emotional speech rhythm and tone in English without hallucinations.
Voice Cloning: Supports voice cloning with finetuning. Successful cloning with as little as 1 minute of training data for Indian speakers.
Zero-shot Cloning: Enables zero-shot cloning for American and British voices using 30s reference audio.
Long-form Synthesis: Provides support for long-form synthesis.

📦 Installation

No installation steps are provided in the original document, so this section is skipped.

💻 Usage Examples

Basic Usage

See Github for the latest usage instructions.

📚 Documentation

Finetuning

See Github for the latest finetuning instructions.

Upcoming Features

Long form / arbitrary length TTS
Streaming

🔧 Technical Details

Architecture

We predict EnCodec tokens from text and speaker information. This is then diffused up to the waveform level, with post-processing applied to clean up the audio.

GPT Prediction: We use a causal GPT to predict the first two hierarchies of EnCodec tokens. Text and audio are part of the LLM context. Speaker information is passed via conditioning at the token embedding layer. This speaker conditioning is obtained from a separately trained speaker verification network.
- The two hierarchies are predicted in a "flattened interleaved" manner, we predict the first token of the first hierarchy, then the first token of the second hierarchy, then the second token of the first hierarchy, and so on.
- We use condition-free sampling to boost the cloning capability of the model.
- The text is tokenised using a custom trained BPE tokeniser with 512 tokens.
- Note that we've skipped predicting semantic tokens as done in other works, as we found that this isn't strictly necessary.
Transformer Prediction: We use a non-causal (encoder-style) transformer to predict the rest of the 6 hierarchies from the first two hierarchies. This is a super small model (~10Mn parameters), and has extensive zero-shot generalisation to most speakers we've tried. Since it's non-causal, we're also able to predict all the timesteps in parallel.
Waveform Generation: We use multi-band diffusion to generate waveforms from the EnCodec tokens. We noticed that the speech is clearer than using the original RVQ decoder or VOCOS. However, the diffusion at waveform level leaves some background artifacts which are quite unpleasant to the ear. We clean this up in the next step.
Artifact Cleaning: We use DeepFilterNet to clear up the artifacts introduced by the multi-band diffusion.

Optimizations

The model supports:

KV-caching via Flash Decoding
Batching (including texts of different lengths)

📄 License

We’re releasing MetaVoice-1B under the Apache 2.0 license, it can be used without restrictions.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご