๐ SpeechT5 (TTS task) for Japanese
A fine - tuned SpeechT5 model for Japanese speech synthesis (text - to - speech), offering high - quality voice output independent of specific speakers.
๐ Quick Start
To start using this model, you need to install the necessary requirements and download the modified code.
Install Requirements
pip install transformers sentencepiece pyopnjtalk
Download Modified Code
curl -O https://huggingface.co/esnya/japanese_speecht5_tts/resolve/main/speecht5_openjtalk_tokenizer.py
โจ Features
- Fine - Tuned for Japanese: This model is fine - tuned on the JVS dataset for Japanese speech synthesis.
- Unique Speaker Embeddings: Utilizes a 16 - dimensional speaker embedding vector crafted from the JVS dataset, aiming for a speaker - independent voice quality.
- Modified Tokenizer: Powered by Open Jtalk, the modified tokenizer separately extracts and retains non - phonation characters for more accurate conversion.
๐ฆ Installation
The installation process mainly includes installing the required Python packages and downloading the modified code.
pip install transformers sentencepiece pyopnjtalk
curl -O https://huggingface.co/esnya/japanese_speecht5_tts/resolve/main/speecht5_openjtalk_tokenizer.py
๐ป Usage Examples
Basic Usage
import numpy as np
from transformers import (
SpeechT5ForTextToSpeech,
SpeechT5HifiGan,
SpeechT5FeatureExtractor,
SpeechT5Processor,
)
from speecht5_openjtalk_tokenizer import SpeechT5OpenjtalkTokenizer
import soundfile
import torch
model_name = "esnya/japanese_speecht5_tts"
with torch.no_grad():
model = SpeechT5ForTextToSpeech.from_pretrained(
model_name, device_map="cuda", torch_dtype=torch.bfloat16
)
tokenizer = SpeechT5OpenjtalkTokenizer.from_pretrained(model_name)
feature_extractor = SpeechT5FeatureExtractor.from_pretrained(model_name)
processor = SpeechT5Processor(feature_extractor, tokenizer)
vocoder = SpeechT5HifiGan.from_pretrained(
"microsoft/speecht5_hifigan", device_map="cuda", torch_dtype=torch.bfloat16
)
input = "ๅพ่ผฉใฏ็ซใงใใใๅๅใฏใพใ ็กใใใฉใใง็ใใใใจใใจ่ฆๅฝใใคใใฌใ"
input_ids = processor(text=input, return_tensors="pt").input_ids.to(model.device)
speaker_embeddings = np.random.uniform(
-1, 1, (1, 16)
)
speaker_embeddings = torch.FloatTensor(speaker_embeddings).to(
device=model.device, dtype=model.dtype
)
waveform = model.generate_speech(
input_ids,
speaker_embeddings,
vocoder=vocoder,
)
waveform = waveform / waveform.abs().max()
waveform = waveform.reshape(-1).cpu().float().numpy()
soundfile.write(
"output.wav",
waveform,
vocoder.config.sampling_rate,
)
๐ Documentation
Model Description
See original model card. The modified codes are licensed under the MIT Licence.
Background
The development of this model was motivated by the lack of Japanese generation models in SpeechT5 TTS. The g2p functionality of Open Jtalk (pyopenjtalk) allowed us to create a vocabulary similar to English models. Special modifications were mainly applied to the tokenizer for more accurate text - to - speech conversion.
Limitations
When multiple sentences are input, the latter parts may have extended silences. As a temporary solution, it is recommended to split and generate each sentence individually.
๐ License
The model inherits the license of JVS Corpus.
๐ See also
- Shinnosuke Takamichi, Kentaro Mitsui, Yuki Saito, Tomoki Koriyama, Naoko Tanji, and Hiroshi Saruwatari, "JVS corpus: free Japanese multi - speaker voice corpus," arXiv preprint, 1908.06248, Aug. 2019.