🚀 VoxPolska: Next-Gen Polish Voice Generation
VoxPolska is a cutting - edge Polish voice generation model that leverages advanced deep - learning techniques to convert written Polish text into natural, fluent, and expressive speech.
🚀 Quick Start
You can quickly start using VoxPolska by following the example usage below.
✨ Features
- Context - Aware Voice: Generates speech that captures the nuances and tone of the Polish language.
- Advanced Proficiency: Showcases advanced proficiency in speech synthesis and Polish language processing.
- Natural Speech Conversion: Converts written Polish text into natural, fluent, and expressive speech.
- Advanced Deep Learning: Built using cutting - edge deep learning techniques for optimal performance.
- State - of - the - Art Technology: Utilizes state - of - the - art technology in speech synthesis and Polish language processing.
🔧 Technical Details
- Base Model: Orpheus TTS
- Fine - tuning: LoRA (Low - Rank Adaptation) fine - tuning applied to optimize model performance.
- Sample Rate: 24 kHz audio output for high - fidelity sound.
- Training Data: Trained with 24000+ Polish transcript and audio pairs
- Quantization: Merged 16 bit quantization
- Audio Decoding: Customized layer - wise processing for audio generation
- Repetition Penalty: 1.1 to avoid repetitive phrases
- Gradient Checkpointing: Enabled for efficient memory usage
💻 Usage Examples
Basic Usage
Here is an example code to run the model on a notebook:
!pip install snac torch transformers
import torch
import snac
from snac import SNAC
from transformers import AutoTokenizer, AutoModelForCausalLM
import os
from IPython.display import display, Audio
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = AutoTokenizer.from_pretrained("salihfurkaan/VoxPolska-V1-Merged-16bit")
model = AutoModelForCausalLM.from_pretrained("salihfurkaan/VoxPolska-V1-Merged-16bit").to(device)
os.environ["HF_TOKEN"] = "your huggingface token here"
snac_model = SNAC.from_pretrained("hubertsiuzdak/snac_24khz").to(device)
prompts = [
"Cześć, jestem dużym modelem języka sztucznej inteligencji"
]
chosen_voice = None
prompts_ = [(f"{chosen_voice}: " + p) if chosen_voice else p for p in prompts]
all_input_ids = []
for prompt in prompts_:
input_ids = tokenizer(prompt, return_tensors="pt").input_ids
all_input_ids.append(input_ids)
start_token = torch.tensor([[128259]], dtype=torch.int64)
end_tokens = torch.tensor([[128009, 128260]], dtype=torch.int64)
all_modified_input_ids = []
for input_ids in all_input_ids:
modified_input_ids = torch.cat([start_token, input_ids, end_tokens], dim=1)
all_modified_input_ids.append(modified_input_ids)
all_padded_tensors = []
all_attention_masks = []
max_length = max([x.shape[1] for x in all_modified_input_ids])
for modified_input_ids in all_modified_input_ids:
padding = max_length - modified_input_ids.shape[1]
padded_tensor = torch.cat([torch.full((1, padding), 128263, dtype=torch.int64), modified_input_ids], dim=1)
attention_mask = torch.cat([torch.zeros((1, padding), dtype=torch.int64), torch.ones((1, modified_input_ids.shape[1]), dtype=torch.int64)], dim=1)
all_padded_tensors.append(padded_tensor)
all_attention_masks.append(attention_mask)
all_padded_tensors = torch.cat(all_padded_tensors, dim=0).to(device)
all_attention_masks = torch.cat(all_attention_masks, dim=0).to(device)
generated_ids = model.generate(
input_ids=all_padded_tensors,
attention_mask=all_attention_masks,
max_new_tokens=1200,
do_sample=True,
temperature=0.6,
top_p=0.95,
repetition_penalty=1.1,
num_return_sequences=1,
eos_token_id=128258,
use_cache=True
)
token_to_find = 128257
token_to_remove = 128258
token_indices = (generated_ids == token_to_find).nonzero(as_tuple=True)
if len(token_indices[1]) > 0:
last_occurrence_idx = token_indices[1][-1].item()
cropped_tensor = generated_ids[:, last_occurrence_idx+1:]
else:
cropped_tensor = generated_ids
processed_rows = []
for row in cropped_tensor:
masked_row = row[row != token_to_remove]
processed_rows.append(masked_row)
code_lists = []
for row in processed_rows:
row_length = row.size(0)
new_length = (row_length // 7) * 7
trimmed_row = row[:new_length]
trimmed_row = [t - 128266 for t in trimmed_row]
code_lists.append(trimmed_row)
def redistribute_codes(code_list):
layer_1 = []
layer_2 = []
layer_3 = []
for i in range((len(code_list) + 1) // 7):
layer_1.append(code_list[7 * i])
layer_2.append(code_list[7 * i + 1] - 4096)
layer_3.append(code_list[7 * i + 2] - (2 * 4096))
layer_3.append(code_list[7 * i + 3] - (3 * 4096))
layer_2.append(code_list[7 * i + 4] - (4 * 4096))
layer_3.append(code_list[7 * i + 5] - (5 * 4096))
layer_3.append(code_list[7 * i + 6] - (6 * 4096))
codes = [
torch.tensor(layer_1).unsqueeze(0).to(device),
torch.tensor(layer_2).unsqueeze(0).to(device),
torch.tensor(layer_3).unsqueeze(0).to(device)
]
audio_hat = snac_model.decode(codes)
return audio_hat
my_samples = []
for code_list in code_lists:
samples = redistribute_codes(code_list)
my_samples.append(samples)
if len(prompts) != len(my_samples):
raise Exception("Number of prompts and samples do not match")
else:
for i in range(len(my_samples)):
print(prompts[i])
samples = my_samples[i]
display(Audio(samples.detach().squeeze().to("cpu").numpy(), rate=24000))
del my_samples, samples
You can get your huggingface token from here
📚 Documentation
Contact and Support
For questions, suggestions, and feedback, please open an issue on HuggingFace. You can also reach the author via:
LinkedIn
Model Misuse
Do not use this model for impersonation without consent, misinformation or deception (including fake news or fraudulent calls), or any illegal or harmful activity. By using this model, you agree to follow all applicable laws and ethical guidelines.
Citation
@misc{
title={salihfurkaan/VoxPolska-V1-Merged-16bit},
author={Salih Furkan Erik},
year={2025},
url={https://huggingface.co/salihfurkaan/VoxPolska-V1-Merged-16bit/}
}
📄 License
This project is licensed under the apache-2.0
license.