Open-source model of tts_ru_free_hf_vits_low_multispeaker - Direct text-to-speech conversion for Russian with multiple speakers

Tts Ru Free Hf Vits Low Multispeaker

Developed by utrobinmv

A Russian text-to-speech model supporting multiple speakers, capable of directly processing plain text with punctuation without prior conversion to phonemes.

Speech Synthesis

Transformers

OtherOpen Source License:Apache-2.0 #Russian TTS #Multi-speaker support #Small parameter model

Downloads 1,021

Release Time : 4/28/2024

Model Overview

This model offers two speaker voices (female and male), supports direct processing of plain Russian text, and recommends using stress annotation for optimal results.

Model Features

Multi-speaker support

Provides two speaker voice options: 0 - female voice, 1 - male voice

Direct text processing

Can directly process plain text with punctuation without prior conversion to phonemes

Lightweight model

The model has only 15.1 million parameters, with low resource consumption

Stress annotation support

Supports stress annotation to improve generation quality, recommends using the ruaccent library for annotation

Model Capabilities

Russian Text-to-Speech

Multi-speaker speech generation

Direct plain text processing

Use Cases

Speech synthesis applications

Audiobook generation

Convert Russian text into natural speech for audiobook production

Can generate speech with distinct speaker characteristics

Voice assistant

Provide speech synthesis capabilities for Russian voice assistants

Supports switching between male and female voices to enhance user experience

Assistive technology

Visual impairment assistance

Convert Russian text into speech to help visually impaired individuals access information

Provides clear and natural speech output

🚀 Text to Speech Russian free multispeaker model

This is a multiple speakers text-to-speech model for the Russian language, which can directly process plain text with punctuation and doesn't need text-to-phoneme conversion.

🚀 Quick Start

This is a multiple speakers text-to-speech model for the Russian language. It works on plain text with punctuation separation, and does not require prior conversion of the text into phonemes. The model with multiple speakers has two voices: 0 - woman, 1 - man.

The size of the model is only 15.1 million parameters.

The text accepts lowercase.

For better generation quality, we recommend putting accents in the text before the vowel letters.

We recommend using the "ruaccent" library for accentuation.

✨ Features

Multiple Speakers: Offers two voices (woman and man) for text-to-speech conversion.
Plain Text Support: Works directly on plain text with punctuation, eliminating the need for text-to-phoneme conversion.
Small Model Size: Only has 15.1 million parameters.
Accentuation Support: Recommends using the "ruaccent" library for better generation quality.

📦 Installation

To install "ruaccent", use:

pip install -y ruaccent

💻 Usage Examples

Basic Usage

For test inference use Spaces:

https://huggingface.co/spaces/utrobinmv/tts_ru_free_hf_vits_low_multispeaker

Advanced Usage

Using PyTorch

from transformers import VitsModel, AutoTokenizer, set_seed
import torch
import scipy
from ruaccent import RUAccent

device = 'cuda' #  'cpu' or 'cuda'

speaker = 0 # 0-woman, 1-man  

set_seed(555)  # make deterministic

# load model
model_name = "utrobinmv/tts_ru_free_hf_vits_low_multispeaker"

model = VitsModel.from_pretrained(model_name).to(device)
tokenizer = AutoTokenizer.from_pretrained(model_name)
model.eval()

# load accentizer
accentizer = RUAccent()
accentizer.load(omograph_model_size='turbo', use_dictionary=True, device=device)

# text
text = """Ночью двадцать третьего июня начал извергаться самый высокий 
действующий вулкан в Евразии - Кл+ючевской. Об этом сообщила руководитель 
Камчатской группы реагирования на вулканические извержения, ведущий 
научный сотрудник Института вулканологии и сейсмологии ДВО РАН Ольга Гирина.
«Зафиксированное ночью не просто свечение, а вершинное эксплозивное 
извержение стромболианского типа. Пока такое извержение никому не опасно: 
ни населению, ни авиации» пояснила ТАСС госпожа Гирина."""

# the placement of accents
text = accentizer.process_all(text)
print(text)
# н+очью дв+адцать тр+етьего и+юня н+ачал изверг+аться с+амый выс+окий 
# д+ействующий вулк+ан в евр+азии - ключевск+ой. об +этом сообщ+ила 
# руковод+итель камч+атской гр+уппы реаг+ирования на вулкан+ические
# изверж+ения, вед+ущий на+учный сотр+удник инстит+ута вулканол+огии
# и сейсмол+огии дво ран +ольга г+ирина. « зафикс+ированное н+очью не
# пр+осто свеч+ение, а верш+инное эксплоз+ивное изверж+ение 
# стромболи+анского т+ипа. пок+а так+ое изверж+ение ником+у не оп+асно:
# ни насел+ению, ни ави+ации » поясн+ила тасс госпож+а г+ирина.

inputs = tokenizer(text, return_tensors="pt")

with torch.no_grad():
    output = model(**inputs.to(device), speaker_id=speaker).waveform
    output = output.detach().cpu().numpy()
    
scipy.io.wavfile.write("tts_audio.wav", rate=model.config.sampling_rate,
                       data=output[0])

For displayed in a Jupyter Notebook / Google Colab:

from IPython.display import Audio

Audio(output, rate=model.config.sampling_rate)

Using ONNX

First copy the model.onnx file to the folder "tts_ru_free_hf_vits_low_multispeaker".

import numpy as np
import scipy
import onnxruntime
from ruaccent import RUAccent
from transformers import AutoTokenizer

speaker = 0 # 0-woman, 1-man

# load model
model_path = "tts_ru_free_hf_vits_low_multispeaker/model.onnx"

sess_options = onnxruntime.SessionOptions()
model = onnxruntime.InferenceSession(model_path, sess_options=sess_options)
tokenizer = AutoTokenizer.from_pretrained("utrobinmv/tts_ru_free_hf_vits_low_multispeaker")

# text
text = """Ночью двадцать третьего июня начал извергаться самый высокий 
действующий вулкан в Евразии - Кл+ючевской. Об этом сообщила руководитель 
Камчатской группы реагирования на вулканические извержения, ведущий 
научный сотрудник Института вулканологии и сейсмологии ДВО РАН Ольга Гирина.
«Зафиксированное ночью не просто свечение, а вершинное эксплозивное 
извержение стромболианского типа. Пока такое извержение никому не опасно: 
ни населению, ни авиации» пояснила ТАСС госпожа Гирина."""

# load accentizer
accentizer = RUAccent()
accentizer.load(omograph_model_size='turbo', use_dictionary=True)

# the placement of accents
text = accentizer.process_all(text)

# inference
inputs = tokenizer(text, return_tensors="np")
sid = np.array([speaker])
sampling_rate = 16000

output = model.run(
            None,
            {
                "input_ids": inputs['input_ids'],
                "attention_mask": inputs['attention_mask'],
                "sid": sid,
            },
        )[0]
        
scipy.io.wavfile.write("tts_audio.wav", rate=sampling_rate,
                       data=output[0])

For displayed in a Jupyter Notebook / Google Colab:

from IPython.display import Audio

Audio(output, rate=sampling_rate)

📚 Documentation

Languages covered

Russian (ru_RU)

📄 License

This project is licensed under the Apache-2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご