
Model Overview
Model Features
Model Capabilities
Use Cases
🚀 Kartoffel-3B (Based on Orpheus-3B) - Synthetic
This is a German text-to-speech (TTS) model family based on Orpheus-3B. It offers high - quality speech synthesis with support for multiple speakers and various emotional expressions.

🚀 Quick Start
The following steps and code example show you how to use the Kartoffel - 3B synthetic model for text - to - speech synthesis.
✨ Features
- Multiple Speakers: The model can generate speech using various speaker identities from predefined speakers.
- Varied Expressions: Capable of generating speech with different emotional tones and expressions based on the input text.
📦 Installation
No specific installation steps are provided in the original document.
💻 Usage Examples
Basic Usage
import torch
import torchaudio.transforms as T
import os
import torch
from snac import SNAC
from peft import PeftModel
import soundfile as sf
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"SebastianBodza/Kartoffel_Orpheus-3B_german_synthetic-v0.1",
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(
"SebastianBodza/Kartoffel_Orpheus-3B_german_synthetic-v0.1",
)
snac_model = SNAC.from_pretrained("hubertsiuzdak/snac_24khz")
snac_model = snac_model.to("cuda")
chosen_voice = "Martin"
prompts = [
'Tief im verwunschenen Wald, wo die Bäume uralte Geheimnisse flüsterten, lebte ein kleiner Gnom namens Fips, der die Sprache der Tiere verstand.',
]
def process_single_prompt(prompt, chosen_voice):
if chosen_voice == "in_prompt" or chosen_voice == "":
full_prompt = prompt
else:
full_prompt = f"{chosen_voice}: {prompt}"
start_token = torch.tensor([[128259]], dtype=torch.int64)
end_tokens = torch.tensor([[128009, 128260]], dtype=torch.int64)
input_ids = tokenizer(full_prompt, return_tensors="pt").input_ids
modified_input_ids = torch.cat([start_token, input_ids, end_tokens], dim=1)
input_ids = modified_input_ids.to("cuda")
attention_mask = torch.ones_like(input_ids)
generated_ids = model.generate(
input_ids=input_ids,
attention_mask=attention_mask,
max_new_tokens=4000,
do_sample=True,
temperature=0.6,
top_p=0.95,
repetition_penalty=1.1,
num_return_sequences=1,
eos_token_id=128258,
use_cache=True,
)
token_to_find = 128257
token_to_remove = 128258
token_indices = (generated_ids == token_to_find).nonzero(as_tuple=True)
if len(token_indices[1]) > 0:
last_occurrence_idx = token_indices[1][-1].item()
cropped_tensor = generated_ids[:, last_occurrence_idx + 1 :]
else:
cropped_tensor = generated_ids
masked_row = cropped_tensor[0][cropped_tensor[0] != token_to_remove]
row_length = masked_row.size(0)
new_length = (row_length // 7) * 7
trimmed_row = masked_row[:new_length]
code_list = [t - 128266 for t in trimmed_row]
return code_list
def redistribute_codes(code_list):
layer_1 = []
layer_2 = []
layer_3 = []
for i in range((len(code_list) + 1) // 7):
layer_1.append(code_list[7 * i])
layer_2.append(code_list[7 * i + 1] - 4096)
layer_3.append(code_list[7 * i + 2] - (2 * 4096))
layer_3.append(code_list[7 * i + 3] - (3 * 4096))
layer_2.append(code_list[7 * i + 4] - (4 * 4096))
layer_3.append(code_list[7 * i + 5] - (5 * 4096))
layer_3.append(code_list[7 * i + 6] - (6 * 4096))
codes = [
torch.tensor(layer_1).unsqueeze(0),
torch.tensor(layer_2).unsqueeze(0),
torch.tensor(layer_3).unsqueeze(0),
]
codes = [c.to("cuda") for c in codes]
audio_hat = snac_model.decode(codes)
return audio_hat
for i, prompt in enumerate(prompts):
print(f"Processing prompt {i + 1}/{len(prompts)}")
with torch.no_grad():
code_list = process_single_prompt(prompt, chosen_voice)
samples = redistribute_codes(code_list)
audio_numpy = samples.detach().squeeze().to("cpu").numpy()
sf.write(f"output_{i}.wav", audio_numpy, 24000)
print(f"Saved output_{i}.wav")
Advanced Usage
The basic usage code already covers most of the key steps. For more advanced usage, you can adjust the parameters in the model.generate
function, such as max_new_tokens
, temperature
, top_p
, etc., to get different speech generation results.
📚 Documentation
Model Overview
This is a German text - to - speech (TTS) model family based on Orpheus - 3B.
Two main versions are available:
- Kartoffel - 3B - Natural: Fine - tuned primarily on natural human speech recordings, aiming for realistic voices. The dataset is based on high - quality German audio, including permissive podcasts, lectures, and other OER data that were processed with an Emilia styled pipeline.
- Kartoffel - 3B - Synthetic: Fine - tuned using synthetic speech data, with emotions and different outbursts. The dataset consists of a diverse set of emotions with 4 different speakers.
This is currently the synthetic version for synthetic sounding speakers, but with added emotion and outburst support.
Available Speakers & Expressions for the synthetic Version
Speakers
- Martin
- Luca
- Anne
- Emma
Emotions
To add emotions the following ones are used:
- Neutral
- Happy
- Sad
- Excited
- Surprised
- Humorous
- Angry
- Calm
- Disgust
- Fear
- Proud
- Romantic
To use them add them behind the speaker name like [Speaker_name] - [Emotion]: [German text]
for example for the speaker Martin and the Emotion sad, the correct template would be:
Martin - Sad: Oh ich bin sooo traurig.
Outbursts
The following outbursts are working:
- haha
- ughh
- wow
- wuhuuu
- ohhh
You can either directly use them in the text or place them in tags. Keep in mind to use the exact text from the variations.
🔧 Technical Details
No specific technical details (more than 50 words of detailed technical description) are provided in the original document.
📄 License
The license of this model is llama3.2.
Property | Details |
---|---|
Library Name | transformers |
Tags | unsloth, text - to - speech, tts, german, orpheus |
Language | de |
Base Model | amuvarma/3b - de - pretrain, canopylabs/orpheus - 3b - 0.1 - ft |
License | llama3.2 |


