E

Emova Speech Tokenizer Hf

Developed by Emova-ollm
EMOVA Speech Tokenizer is a discrete speech tokenizer supporting both English and Chinese, featuring semantic-acoustic decoupling design and flexible speech style control.
Downloads 895
Release Time : 12/23/2024

Model Overview

This model is a discrete speech tokenizer comprising a Speech-to-Unit (S2U) tokenizer and a Unit-to-Speech (U2S) decoder. It enables seamless full-modal alignment across visual, linguistic, and speech modalities while supporting flexible speech style control including speaker, emotion, and pitch.

Model Features

Semantic-Acoustic Decoupling Design
Decouples semantic content from acoustic style in input speech, using only the former to generate speech tokens for seamless alignment with LLM's high-semantic embedding space
Bilingual Tokenization Support
Supports tokenizing both Chinese and English speech using the same speech codebook
Flexible Speech Style Control
Supports 24 speech style controls (2 speakers × 3 pitch levels × 4 emotion combinations)
Discrete Speech Tokenization
Discretizes speech into speech units via Finite Scalar Quantization (FSQ) for streamlined downstream processing

Model Capabilities

Speech-to-Unit (S2U)
Unit-to-Speech (U2S)
Speech Style Control
English-Chinese Speech Processing

Use Cases

Speech Synthesis
Emotional Speech Synthesis
Generates speech with specified emotions based on input text and emotional parameters
Capable of producing angry, happy, neutral, and sad emotional speech
Multi-Style Speech Synthesis
Controls stylistic aspects like speaker, pitch, and speech rate in generated speech
Supports 24 different style combinations in speech output
Speech Processing
Speech Feature Extraction
Converts speech signals into discrete speech unit representations
Extracted phoneme and tone information can be used for subsequent speech processing tasks
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
Š 2025AIbase