VoxPolska-V1-Merged-16bit Open-source Model - Free and Natural-sounding Polish Text-to-Speech

Voxpolska V1 Merged 16bit

Developed by salihfurkaan

VoxPolska is an advanced model focused on Polish text-to-speech conversion, capable of generating natural, fluent, and expressive Polish speech.

Speech Synthesis

Transformers

OtherOpen Source License:Apache-2.0 #Polish speech synthesis #High-fidelity audio #Context-aware

Downloads 116

Release Time : 5/6/2025

Model Overview

VoxPolska is a Polish text-to-speech conversion model based on the Orpheus TTS architecture, optimized through LoRA fine-tuning and 16-bit quantization to transform Polish written text into high-quality speech output.

Model Features

Context-aware speech

Capable of capturing subtle nuances and intonations in Polish, generating natural and fluent speech.

High-fidelity audio quality

24 kHz audio output for high-quality speech synthesis.

Efficient training

Optimized model performance using LoRA fine-tuning and 16-bit quantization techniques.

Large-scale training data

Trained on 24,000+ Polish text-audio pairs.

Model Capabilities

Polish text-to-speech conversion

High-quality speech synthesis

Context-aware speech generation

Use Cases

Speech synthesis applications

Voice assistants

Providing natural and fluent speech output for Polish voice assistants.

Generates expressive Polish speech.

Audiobooks

Converting Polish text into audiobooks.

High-quality speech that preserves text emotions and intonations.

Voice navigation systems

Delivering clear voice guidance for Polish navigation systems.

Natural speech that accurately conveys navigation information.

🚀 VoxPolska: Next-Gen Polish Voice Generation

VoxPolska is a cutting - edge Polish voice generation model that leverages advanced deep - learning techniques to convert written Polish text into natural, fluent, and expressive speech.

🚀 Quick Start

You can quickly start using VoxPolska by following the example usage below.

✨ Features

Context - Aware Voice: Generates speech that captures the nuances and tone of the Polish language.
Advanced Proficiency: Showcases advanced proficiency in speech synthesis and Polish language processing.
Natural Speech Conversion: Converts written Polish text into natural, fluent, and expressive speech.
Advanced Deep Learning: Built using cutting - edge deep learning techniques for optimal performance.
State - of - the - Art Technology: Utilizes state - of - the - art technology in speech synthesis and Polish language processing.

🔧 Technical Details

Base Model: Orpheus TTS
Fine - tuning: LoRA (Low - Rank Adaptation) fine - tuning applied to optimize model performance.
Sample Rate: 24 kHz audio output for high - fidelity sound.
Training Data: Trained with 24000+ Polish transcript and audio pairs
Quantization: Merged 16 bit quantization
Audio Decoding: Customized layer - wise processing for audio generation
Repetition Penalty: 1.1 to avoid repetitive phrases
Gradient Checkpointing: Enabled for efficient memory usage

💻 Usage Examples

Basic Usage

Here is an example code to run the model on a notebook:

!pip install snac torch transformers

import torch
import snac
from snac import SNAC
from transformers import AutoTokenizer, AutoModelForCausalLM
import os
from IPython.display import display, Audio

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

tokenizer = AutoTokenizer.from_pretrained("salihfurkaan/VoxPolska-V1-Merged-16bit")
model = AutoModelForCausalLM.from_pretrained("salihfurkaan/VoxPolska-V1-Merged-16bit").to(device)

os.environ["HF_TOKEN"] = "your huggingface token here"
snac_model = SNAC.from_pretrained("hubertsiuzdak/snac_24khz").to(device)

prompts = [
    "Cześć, jestem dużym modelem języka sztucznej inteligencji"
]  #an example prompt
chosen_voice = None

prompts_ = [(f"{chosen_voice}: " + p) if chosen_voice else p for p in prompts]
all_input_ids = []
for prompt in prompts_:
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids
    all_input_ids.append(input_ids)

start_token = torch.tensor([[128259]], dtype=torch.int64)  # Start of human
end_tokens = torch.tensor([[128009, 128260]], dtype=torch.int64)  # End of text, End of human

all_modified_input_ids = []
for input_ids in all_input_ids:
    modified_input_ids = torch.cat([start_token, input_ids, end_tokens], dim=1)
    all_modified_input_ids.append(modified_input_ids)

all_padded_tensors = []
all_attention_masks = []
max_length = max([x.shape[1] for x in all_modified_input_ids])
for modified_input_ids in all_modified_input_ids:
    padding = max_length - modified_input_ids.shape[1]
    padded_tensor = torch.cat([torch.full((1, padding), 128263, dtype=torch.int64), modified_input_ids], dim=1)
    attention_mask = torch.cat([torch.zeros((1, padding), dtype=torch.int64), torch.ones((1, modified_input_ids.shape[1]), dtype=torch.int64)], dim=1)
    all_padded_tensors.append(padded_tensor)
    all_attention_masks.append(attention_mask)

all_padded_tensors = torch.cat(all_padded_tensors, dim=0).to(device)
all_attention_masks = torch.cat(all_attention_masks, dim=0).to(device)

generated_ids = model.generate(
    input_ids=all_padded_tensors,
    attention_mask=all_attention_masks,
    max_new_tokens=1200,
    do_sample=True,
    temperature=0.6,
    top_p=0.95,
    repetition_penalty=1.1,
    num_return_sequences=1,
    eos_token_id=128258,
    use_cache=True
)

token_to_find = 128257
token_to_remove = 128258
token_indices = (generated_ids == token_to_find).nonzero(as_tuple=True)

if len(token_indices[1]) > 0:
    last_occurrence_idx = token_indices[1][-1].item()
    cropped_tensor = generated_ids[:, last_occurrence_idx+1:]
else:
    cropped_tensor = generated_ids

processed_rows = []
for row in cropped_tensor:
    masked_row = row[row != token_to_remove]
    processed_rows.append(masked_row)

code_lists = []
for row in processed_rows:
    row_length = row.size(0)
    new_length = (row_length // 7) * 7
    trimmed_row = row[:new_length]
    trimmed_row = [t - 128266 for t in trimmed_row]
    code_lists.append(trimmed_row)

def redistribute_codes(code_list):
    layer_1 = []
    layer_2 = []
    layer_3 = []
    for i in range((len(code_list) + 1) // 7):
        layer_1.append(code_list[7 * i])
        layer_2.append(code_list[7 * i + 1] - 4096)
        layer_3.append(code_list[7 * i + 2] - (2 * 4096))
        layer_3.append(code_list[7 * i + 3] - (3 * 4096))
        layer_2.append(code_list[7 * i + 4] - (4 * 4096))
        layer_3.append(code_list[7 * i + 5] - (5 * 4096))
        layer_3.append(code_list[7 * i + 6] - (6 * 4096))

    codes = [
        torch.tensor(layer_1).unsqueeze(0).to(device),
        torch.tensor(layer_2).unsqueeze(0).to(device),
        torch.tensor(layer_3).unsqueeze(0).to(device)
    ]
    audio_hat = snac_model.decode(codes)
    return audio_hat

my_samples = []
for code_list in code_lists:
    samples = redistribute_codes(code_list)
    my_samples.append(samples)

if len(prompts) != len(my_samples):
    raise Exception("Number of prompts and samples do not match")
else:
    for i in range(len(my_samples)):
        print(prompts[i])
        samples = my_samples[i]
        display(Audio(samples.detach().squeeze().to("cpu").numpy(), rate=24000))

del my_samples, samples

You can get your huggingface token from here

📚 Documentation

Contact and Support

For questions, suggestions, and feedback, please open an issue on HuggingFace. You can also reach the author via: LinkedIn

Model Misuse

Do not use this model for impersonation without consent, misinformation or deception (including fake news or fraudulent calls), or any illegal or harmful activity. By using this model, you agree to follow all applicable laws and ethical guidelines.

Citation

@misc{
  title={salihfurkaan/VoxPolska-V1-Merged-16bit},
  author={Salih Furkan Erik},
  year={2025},
  url={https://huggingface.co/salihfurkaan/VoxPolska-V1-Merged-16bit/}
}

📄 License

This project is licensed under the apache-2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご