Japanese_speechT5_TTS Open-Source Model - Empowering Fast Conversion of Japanese Text into Natural Speech

Home

Japanese Speecht5 Tts

Developed by esnya

SpeechT5 model fine-tuned on JVS Japanese speech corpus, specialized for Japanese text-to-speech (TTS) tasks

Speech Synthesis

Transformers

Japanese#Japanese TTS #Multi-speaker embedding #OpenJtalk tokenization

Downloads 296

Release Time : 8/8/2023

Model Overview

This model is fine-tuned on the JVS dataset, supporting Japanese text-to-speech conversion, utilizing 16-dimensional speaker embedding vectors to achieve speaker-independent universal sound quality performance.

Model Features

Japanese-specific Speech Synthesis

Speech synthesis model optimized specifically for Japanese, trained on the JVS Japanese speech corpus

Speaker-independent Design

Uses 16-dimensional speaker embedding vectors to achieve speaker-independent universal sound quality performance

Improved Tokenizer

Tokenizer enhanced with Open Jtalk technology for more accurate processing of Japanese text

Model Capabilities

Japanese text-to-speech

Speech synthesis

Support for multiple speaker tones

Use Cases

Speech Synthesis Applications

Audiobook Generation

Convert Japanese text into natural speech for audiobook production

Generates audio output close to human speech

Voice Assistants

Provides speech synthesis capabilities for Japanese voice assistants

Can generate voice responses in different tones

🚀 SpeechT5 (TTS task) for Japanese

A fine - tuned SpeechT5 model for Japanese speech synthesis (text - to - speech), offering high - quality voice output independent of specific speakers.

🚀 Quick Start

To start using this model, you need to install the necessary requirements and download the modified code.

Install Requirements

pip install transformers sentencepiece pyopnjtalk # or pyopenjtalk-prebuilt

Download Modified Code

curl -O https://huggingface.co/esnya/japanese_speecht5_tts/resolve/main/speecht5_openjtalk_tokenizer.py

✨ Features

Fine - Tuned for Japanese: This model is fine - tuned on the JVS dataset for Japanese speech synthesis.
Unique Speaker Embeddings: Utilizes a 16 - dimensional speaker embedding vector crafted from the JVS dataset, aiming for a speaker - independent voice quality.
Modified Tokenizer: Powered by Open Jtalk, the modified tokenizer separately extracts and retains non - phonation characters for more accurate conversion.

📦 Installation

The installation process mainly includes installing the required Python packages and downloading the modified code.

pip install transformers sentencepiece pyopnjtalk # or pyopenjtalk-prebuilt
curl -O https://huggingface.co/esnya/japanese_speecht5_tts/resolve/main/speecht5_openjtalk_tokenizer.py

💻 Usage Examples

Basic Usage

import numpy as np
from transformers import (
    SpeechT5ForTextToSpeech,
    SpeechT5HifiGan,
    SpeechT5FeatureExtractor,
    SpeechT5Processor,
)
from speecht5_openjtalk_tokenizer import SpeechT5OpenjtalkTokenizer
import soundfile
import torch

model_name = "esnya/japanese_speecht5_tts"
with torch.no_grad():

    model = SpeechT5ForTextToSpeech.from_pretrained(
        model_name, device_map="cuda", torch_dtype=torch.bfloat16
    )

    tokenizer = SpeechT5OpenjtalkTokenizer.from_pretrained(model_name)
    feature_extractor = SpeechT5FeatureExtractor.from_pretrained(model_name)
    processor = SpeechT5Processor(feature_extractor, tokenizer)
    vocoder = SpeechT5HifiGan.from_pretrained(
        "microsoft/speecht5_hifigan", device_map="cuda", torch_dtype=torch.bfloat16
    )

    input = "吾輩は猫である。名前はまだ無い。どこで生れたかとんと見当がつかぬ。"
    input_ids = processor(text=input, return_tensors="pt").input_ids.to(model.device)

    speaker_embeddings = np.random.uniform(
        -1, 1, (1, 16)
    )  # (batch_size, speaker_embedding_dim = 16), first dimension means male (-1.0) / female (1.0)
    speaker_embeddings = torch.FloatTensor(speaker_embeddings).to(
        device=model.device, dtype=model.dtype
    )

    waveform = model.generate_speech(
        input_ids,
        speaker_embeddings,
        vocoder=vocoder,
    )

    waveform = waveform / waveform.abs().max()  # normalize
    waveform = waveform.reshape(-1).cpu().float().numpy()

    soundfile.write(
        "output.wav",
        waveform,
        vocoder.config.sampling_rate,
    )

📚 Documentation

Model Description

See original model card. The modified codes are licensed under the MIT Licence.

Background

The development of this model was motivated by the lack of Japanese generation models in SpeechT5 TTS. The g2p functionality of Open Jtalk (pyopenjtalk) allowed us to create a vocabulary similar to English models. Special modifications were mainly applied to the tokenizer for more accurate text - to - speech conversion.

Limitations

When multiple sentences are input, the latter parts may have extended silences. As a temporary solution, it is recommended to split and generate each sentence individually.

📄 License

The model inherits the license of JVS Corpus.

🔗 See also

Shinnosuke Takamichi, Kentaro Mitsui, Yuki Saito, Tomoki Koriyama, Naoko Tanji, and Hiroshi Saruwatari, "JVS corpus: free Japanese multi - speaker voice corpus," arXiv preprint, 1908.06248, Aug. 2019.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご