emova_speech_tokenizer_hf Open-source Speech Tokenizer - Supports Chinese and English, Flexibly Controls Speech Style

Emova Speech Tokenizer Hf

Developed by Emova-ollm

EMOVA Speech Tokenizer is a discrete speech tokenizer supporting both English and Chinese, featuring semantic-acoustic decoupling design and flexible speech style control.

Text-to-Audio

Transformers

Supports Multiple LanguagesOpen Source License:Apache-2.0 #Bilingual Speech Tokenization #Semantic-Acoustic Decoupling #Speech Style Control

Downloads 895

Release Time : 12/23/2024

Model Overview

This model is a discrete speech tokenizer comprising a Speech-to-Unit (S2U) tokenizer and a Unit-to-Speech (U2S) decoder. It enables seamless full-modal alignment across visual, linguistic, and speech modalities while supporting flexible speech style control including speaker, emotion, and pitch.

Model Features

Semantic-Acoustic Decoupling Design

Decouples semantic content from acoustic style in input speech, using only the former to generate speech tokens for seamless alignment with LLM's high-semantic embedding space

Bilingual Tokenization Support

Supports tokenizing both Chinese and English speech using the same speech codebook

Flexible Speech Style Control

Supports 24 speech style controls (2 speakers × 3 pitch levels × 4 emotion combinations)

Discrete Speech Tokenization

Discretizes speech into speech units via Finite Scalar Quantization (FSQ) for streamlined downstream processing

Model Capabilities

Speech-to-Unit (S2U)

Unit-to-Speech (U2S)

Speech Style Control

English-Chinese Speech Processing

Use Cases

Speech Synthesis

Emotional Speech Synthesis

Generates speech with specified emotions based on input text and emotional parameters

Capable of producing angry, happy, neutral, and sad emotional speech

Multi-Style Speech Synthesis

Controls stylistic aspects like speaker, pitch, and speech rate in generated speech

Supports 24 different style combinations in speech output

Speech Processing

Speech Feature Extraction

Converts speech signals into discrete speech unit representations

Extracted phoneme and tone information can be used for subsequent speech processing tasks

🚀 EMOVA Speech Tokenizer HF

This repository offers an official speech tokenizer for training the EMOVA series of models. It supports seamless alignment across multiple modalities and flexible speech style control.

🚀 Quick Start

EMOVA speech tokenizer can be easily deployed using the 🤗 HuggingFace transformers API! But before this, remember to finish the installation first.

✨ Features

Discrete speech tokenizer: it contains a SPIRAL-based speech-to-unit (S2U) tokenizer to capture both phonetic and tonal information of input speeches, which is then discretized by a finite scalar quantizater (FSQ) into discrete speech units, and a VITS-based unit-to-speech (U2S) de-tokenizer to reconstruct speech signals from speech units.
Semantic-acoustic disentanglement: to seamlessly align speech units with the highly semantic embedding space of LLMs, we opt for decoupling the semantic contents and acoustic styles of input speeches, and only the former are utilized to generate the speech tokens.
Biligunal tokenization: EMOVA speech tokenizer supports both Chinese and English speech tokenization with the same speech codebook.
Flexible speech style control: thanks to the semantic-acoustic disentanglement, EMOVA speech tokenizer supports 24 speech style controls (i.e., 2 speakers, 3 pitches, and 4 emotions).

🤗 HuggingFace | 📄 Paper | 🌐 Project-Page | 💻 Github | 💻 EMOVA-Github

📦 Installation

Clone this repo and create the EMOVA virtual environment with conda. Our code has been validated on NVIDIA A800/H20 GPU & Ascend 910B3 NPU servers. Other devices might be available as well.

Initialize the conda environment:

git clone https://github.com/emova-ollm/EMOVA_speech_tokenizer.git
conda create -n emova python=3.10 -y
conda activate emova

Install the required packages (note that instructions are different from GPUs and NPUs):

# upgrade pip and setuptools if necessary
pip install -U pip setuptools

cd emova_speech_tokenizer
pip install -e . # for NVIDIA GPUs (e.g., A800 and H20)
pip install -e .[npu] # OR for Ascend NPUS (e.g., 910B3)

💻 Usage Examples

Basic Usage

import random
from transformers import AutoModel
import torch

### Uncomment if you want to use Ascend NPUs
# import torch_npu
# from torch_npu.contrib import transfer_to_npu

# load pretrained model
model = AutoModel.from_pretrained("Emova-ollm/emova_speech_tokenizer_hf", torch_dtype=torch.float32, trust_remote_code=True).eval().cuda()

# S2U
wav_file = "./examples/s2u/example.wav"
speech_unit = model.encode(wav_file)
print(speech_unit)

# U2S
emotion = random.choice(['angry', 'happy', 'neutral', 'sad'])
speed = random.choice(['normal', 'fast', 'slow'])
pitch = random.choice(['normal', 'high', 'low'])
gender = random.choice(['female', 'male'])
condition = f'gender-{gender}_emotion-{emotion}_speed-{speed}_pitch-{pitch}'

output_wav_file = f'./examples/u2s/{condition}_output.wav'
model.decode(speech_unit, condition=condition, output_wav_file=output_wav_file)

📚 Documentation

If you find our model/code/paper helpful, please consider citing our papers and staring us!

@article{chen2024emova,
  title={Emova: Empowering language models to see, hear and speak with vivid emotions},
  author={Chen, Kai and Gou, Yunhao and Huang, Runhui and Liu, Zhili and Tan, Daxin and Xu, Jing and Wang, Chunwei and Zhu, Yi and Zeng, Yihan and Yang, Kuo and others},
  journal={arXiv preprint arXiv:2409.18042},
  year={2024}
}

📄 License

This project is licensed under the apache-2.0 license.

Property	Details
Library Name	transformers
Tags	speech, tokenization
License	apache-2.0
Language	en, zh
Base Model	Emova-ollm/emova_speech_tokenizer

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご