Kartoffel_Orpheus-3B_german_synthetic-v0.1 Open-source German Text-to-Speech Model - Support for Multiple Speakers and Emotional Expressions

Home

Kartoffel Orpheus 3B German Synthetic V0.1

Developed by SebastianBodza

A German text-to-speech (TTS) model based on Orpheus-3B, supporting multiple speakers and emotional expression.

Speech Synthesis

Transformers

Supports Multiple Languages#German TTS #Multi-emotion speech synthesis #Multi-speaker support

Downloads 147

Release Time : 4/5/2025

Model Overview

This is a German text-to-speech model that supports synthetic voice speakers and additionally provides emotional and tonal expression features.

Model Features

Multi-speaker support

The model can generate speech from predefined speakers with different identities.

Emotional expression

Supports various emotional expressions such as happiness, sadness, excitement, etc.

Tonal expression

Supports various interjections such as haha, ughh, wow, etc.

Model Capabilities

German text-to-speech

Multi-speaker speech generation

Emotional speech synthesis

Interjection recognition and generation

Use Cases

Speech synthesis

Audiobooks

Generate emotionally rich audiobook content.

Natural and emotionally expressive voice output.

Virtual assistants

Provide multi-speaker and emotionally responsive voice interactions for virtual assistants.

Enhance user experience with more natural interactions.

🚀 Kartoffel-3B (Based on Orpheus-3B) - Synthetic

This is a German text-to-speech (TTS) model family based on Orpheus-3B. It offers high - quality speech synthesis with support for multiple speakers and various emotional expressions.

🚀 Quick Start

The following steps and code example show you how to use the Kartoffel - 3B synthetic model for text - to - speech synthesis.

✨ Features

Multiple Speakers: The model can generate speech using various speaker identities from predefined speakers.
Varied Expressions: Capable of generating speech with different emotional tones and expressions based on the input text.

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

import torch
import torchaudio.transforms as T
import os
import torch
from snac import SNAC

from peft import PeftModel
import soundfile as sf
from transformers import AutoModelForCausalLM, AutoTokenizer


model = AutoModelForCausalLM.from_pretrained(
    "SebastianBodza/Kartoffel_Orpheus-3B_german_synthetic-v0.1",
    device_map="auto",
)

tokenizer = AutoTokenizer.from_pretrained(
    "SebastianBodza/Kartoffel_Orpheus-3B_german_synthetic-v0.1",
)

snac_model = SNAC.from_pretrained("hubertsiuzdak/snac_24khz")
snac_model = snac_model.to("cuda")

chosen_voice = "Martin"

prompts = [
    'Tief im verwunschenen Wald, wo die Bäume uralte Geheimnisse flüsterten, lebte ein kleiner Gnom namens Fips, der die Sprache der Tiere verstand.',
]

def process_single_prompt(prompt, chosen_voice):
    if chosen_voice == "in_prompt" or chosen_voice == "":
        full_prompt = prompt
    else:
        full_prompt = f"{chosen_voice}: {prompt}"
    start_token = torch.tensor([[128259]], dtype=torch.int64)
    end_tokens = torch.tensor([[128009, 128260]], dtype=torch.int64)

    input_ids = tokenizer(full_prompt, return_tensors="pt").input_ids
    modified_input_ids = torch.cat([start_token, input_ids, end_tokens], dim=1)

    input_ids = modified_input_ids.to("cuda")
    attention_mask = torch.ones_like(input_ids)

    generated_ids = model.generate(
        input_ids=input_ids,
        attention_mask=attention_mask,
        max_new_tokens=4000,
        do_sample=True,
        temperature=0.6,
        top_p=0.95,
        repetition_penalty=1.1,
        num_return_sequences=1,
        eos_token_id=128258,
        use_cache=True,
    )

    token_to_find = 128257
    token_to_remove = 128258

    token_indices = (generated_ids == token_to_find).nonzero(as_tuple=True)

    if len(token_indices[1]) > 0:
        last_occurrence_idx = token_indices[1][-1].item()
        cropped_tensor = generated_ids[:, last_occurrence_idx + 1 :]
    else:
        cropped_tensor = generated_ids

    masked_row = cropped_tensor[0][cropped_tensor[0] != token_to_remove]
    row_length = masked_row.size(0)
    new_length = (row_length // 7) * 7
    trimmed_row = masked_row[:new_length]
    code_list = [t - 128266 for t in trimmed_row]

    return code_list


def redistribute_codes(code_list):
    layer_1 = []
    layer_2 = []
    layer_3 = []
    for i in range((len(code_list) + 1) // 7):
        layer_1.append(code_list[7 * i])
        layer_2.append(code_list[7 * i + 1] - 4096)
        layer_3.append(code_list[7 * i + 2] - (2 * 4096))
        layer_3.append(code_list[7 * i + 3] - (3 * 4096))
        layer_2.append(code_list[7 * i + 4] - (4 * 4096))
        layer_3.append(code_list[7 * i + 5] - (5 * 4096))
        layer_3.append(code_list[7 * i + 6] - (6 * 4096))

    codes = [
        torch.tensor(layer_1).unsqueeze(0),
        torch.tensor(layer_2).unsqueeze(0),
        torch.tensor(layer_3).unsqueeze(0),
    ]
    codes = [c.to("cuda") for c in codes]

    audio_hat = snac_model.decode(codes)
    return audio_hat


for i, prompt in enumerate(prompts):
    print(f"Processing prompt {i + 1}/{len(prompts)}")
    with torch.no_grad():
        code_list = process_single_prompt(prompt, chosen_voice)
        samples = redistribute_codes(code_list)


    audio_numpy = samples.detach().squeeze().to("cpu").numpy()
    sf.write(f"output_{i}.wav", audio_numpy, 24000)
    print(f"Saved output_{i}.wav")

Advanced Usage

The basic usage code already covers most of the key steps. For more advanced usage, you can adjust the parameters in the model.generate function, such as max_new_tokens, temperature, top_p, etc., to get different speech generation results.

📚 Documentation

Model Overview

This is a German text - to - speech (TTS) model family based on Orpheus - 3B.

Two main versions are available:

Kartoffel - 3B - Natural: Fine - tuned primarily on natural human speech recordings, aiming for realistic voices. The dataset is based on high - quality German audio, including permissive podcasts, lectures, and other OER data that were processed with an Emilia styled pipeline.
Kartoffel - 3B - Synthetic: Fine - tuned using synthetic speech data, with emotions and different outbursts. The dataset consists of a diverse set of emotions with 4 different speakers.

This is currently the synthetic version for synthetic sounding speakers, but with added emotion and outburst support.

Available Speakers & Expressions for the synthetic Version

Speakers

Martin
Luca
Anne
Emma

Emotions

To add emotions the following ones are used:

Neutral
Happy
Sad
Excited
Surprised
Humorous
Angry
Calm
Disgust
Fear
Proud
Romantic

To use them add them behind the speaker name like [Speaker_name] - [Emotion]: [German text] for example for the speaker Martin and the Emotion sad, the correct template would be:

Martin - Sad: Oh ich bin sooo traurig.

Outbursts

The following outbursts are working:

haha
ughh
wow
wuhuuu
ohhh

You can either directly use them in the text or place them in tags. Keep in mind to use the exact text from the variations.

🔧 Technical Details

No specific technical details (more than 50 words of detailed technical description) are provided in the original document.

📄 License

The license of this model is llama3.2.

Property	Details
Library Name	transformers
Tags	unsloth, text - to - speech, tts, german, orpheus
Language	de
Base Model	amuvarma/3b - de - pretrain, canopylabs/orpheus - 3b - 0.1 - ft
License	llama3.2

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご