XCodec2 Open-source Speech Tokenizer - Supports Multilingual Speech Understanding and High-quality Reconstruction

Home

Xcodec2

Developed by HKUSTAudio

XCodec2 is a voice tokenizer supporting multilingual voice semantic understanding and high-quality voice reconstruction

Speech Synthesis

Safetensors

#Voice Tokenizer #High-Quality Voice Reconstruction #Multilingual Voice Understanding

Downloads 32.36k

Release Time : 1/7/2025

Model Overview

XCodec2 is a voice tokenizer optimized for training and inference computation scale based on LLaMA voice synthesis, featuring single vector quantization and 50 tokens per second, supporting multilingual voice semantic understanding and high-quality voice reconstruction.

Model Features

Single Vector Quantization

Supports efficient voice encoding and decoding

Efficient Token Generation

Generates 50 tokens per second for fast voice processing

Multilingual Support

Supports multilingual voice semantic understanding and reconstruction

High-Quality Reconstruction

Achieves high-quality voice reconstruction

Model Capabilities

Voice Encoding

Voice Decoding

Voice Semantic Understanding

Voice Reconstruction

Use Cases

Voice Processing

Voice Compression and Reconstruction

Compresses voice signals into tokens and reconstructs them into high-quality voice

High-quality voice reconstruction

Multilingual Voice Processing

Supports semantic understanding and processing of multilingual voice

Cross-language voice applications

🚀 XCodec2: A Speech Tokenizer

XCodec2 is a speech tokenizer that provides single vector quantization, high tokenization speed, and support for multilingual speech semantics. It enables high - quality speech reconstruction, addressing key needs in speech processing.

🚀 Quick Start

Installation

To use xcodec2, you need to install it first. You can use the following commands:

conda create -n xcodec2 python=3.9
conda activate xcodec2
pip install xcodec2  (Use `xcodec2==0.1.5` for codec inference and llasa fine - tuning. I’ve removed unnecessary dependencies, and it works fine in my testing. However,  I’m not sure if other problems may arise. If you prefer more stability, I recommend using `xcodec2==0.1.3` which accurately aligns during my codec training.)

Usage Examples

Basic Usage

import torch
import soundfile as sf
from transformers import AutoConfig

from xcodec2.modeling_xcodec2 import XCodec2Model

model_path = "HKUSTAudio/xcodec2"  

model = XCodec2Model.from_pretrained(model_path)
model.eval().cuda()   

wav, sr = sf.read("test.wav")   
wav_tensor = torch.from_numpy(wav).float().unsqueeze(0)  # Shape: (1, T)

with torch.no_grad():
   # Only 16khz speech
   # Only supports single input. For batch inference, please refer to the link below.
    vq_code = model.encode_code(input_waveform=wav_tensor)
    print("Code:", vq_code )  

    recon_wav = model.decode_code(vq_code).cpu()       # Shape: (1, 1, T')

sf.write("reconstructed.wav", recon_wav[0, 0, :].numpy(), sr)
print("Done! Check reconstructed.wav")

✨ Features

Single Vector Quantization
50 Tokens per Second
Multilingual Speech Semantic Support and High - Quality Speech Reconstruction

📚 Documentation

Paper

LLaSA: Scaling Train Time and Inference Time Compute for LLaMA based Speech Synthesis
Codec Does Matter: Exploring the Semantic Shortcoming of Codec for Audio Language Model (AAAI 2025, xcodec 1.0)

Updates

Update (2025 - 02 - 13): Add Llasa finetune instruction.
Update (2025 - 02 - 07): Our paper has been released!

Other Resources

If you want to train your own xcodec2, perform batch inference, or large - scale code extraction, the code is released [here](https://github.com/zhenye234/X - Codec - 2.0).

📄 License

This project is licensed under the cc - by - nc - 4.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご