Hypa_Orpheus-3b开源多语言文本转语音模型 - 助力非洲语，支持克隆与情感合成

首页

Hypa Orpheus 3b 0.1 Ft Unsloth Merged 16bit

由 hypaai 开发

基于Orpheus-3b微调的多语言文本转语音模型，专为非洲低资源语言优化，支持语音克隆与情感合成

语音合成

Transformers

支持多种语言开源协议:Apache-2.0 #非洲语言TTS #语音克隆 #低资源优化

下载量 47

发布时间 : 4/21/2025

模型简介

这是经过16位量化和合并、内存优化的Orpheus微调版本，采用Unsloth和LoRA技术优化，专为富有表现力的多语言文本转语音设计，尤其适用于非洲低资源语言。

模型特点

非洲语言优化

专门针对伊博语、约鲁巴语、豪萨语等非洲低资源语言进行优化

语音克隆

支持个性化语音克隆，可模仿特定说话人的声音特征

情感合成

能够生成带有笑声、叹气等情感特征的语音

高效推理

采用4位量化和LoRA技术优化，内存占用低，推理效率高

模型能力

多语言文本转语音

语音克隆

情感语音合成

低资源语言支持

使用案例

教育

非洲语言学习辅助

为学习非洲语言的学习者提供发音示范

生成自然流畅的伊博语、约鲁巴语等语音

无障碍技术

非洲语言屏幕阅读器

为视障人士提供非洲语言的文本转语音服务

支持多种非洲语言的语音输出

媒体制作

本地化内容配音

为非洲地区的媒体内容提供本地化配音

生成带有当地口音和文化特色的语音

🚀 Hypa_Orpheus-3b-0.1-ft (merged 16-bit)

这是一个经过16位量化和合并处理的模型，是对canopylabs/orpheus-3b-0.1-ft进行微调后的版本，具有高效的内存使用效率。它借助Unsloth和LoRA进行了优化，适用于富有表现力的多语言文本转语音（TTS）任务，尤其在处理低资源非洲语言方面表现出色。该模型具备以下能力：

文本转语音生成
为代表性不足的口音进行语音合成
语音克隆与情感合成
多语言低资源语音AI研究

📚 详细文档

模型概述

本模型在一个并行文本 - 语音数据集上进行训练，该数据集包含超过300小时（75k样本）的尼日利亚口音和低资源语言音频（伊博语、约鲁巴语、豪萨语）。数据集中的关键部分来自AfroVoices对真实世界YouTube数据的转录（标记为随机说话者，约100 + 小时）。为了在避免灾难性遗忘的同时保留并增强多语言能力，我们纳入了使用默认情感提示从原始8种Orpheus语音中采样的合成语音 - 文本数据。最终的训练集还包含了新的说话者，如：

Eniola（40小时）– 女性，大胆、清晰
Moyo（40小时）– 女性，专业、表达清晰
Lovelyn（35小时）– 女性，温暖、害羞
Precious（30小时）– 女性，友好、温柔

该模型在跨非洲语言的低资源多语言TTS任务中达到了当前的最优性能（见下面的训练统计数据）。

基础模型详情

由Canopy Labs发布的默认Orpheus - TTS模型支持以下语音和情感：

语音：tara、leah、jess、leo、dan、mia、zac和zoe。
情感：<laugh>、<chuckle>、<sigh>、<cough>、<sniffle>、<groan>、<yawn>和<gasp>。

通过合成数据的生成和添加，我们的微调模型也保留了这些语音和情感。有关语音和情感的更多信息，请访问默认模型的卡片页面。

模型样本生成

🎧 聆听Hypa Orpheus - TTS生成的样本

文本输入	语言	语音
I am cooking for guests tomorrow and need to know how to make aioli. Can you give me a step - by - step recipe.	英语	Emmanuel
Ina dafa abinci don bakin gobe kuma ina bukatar sanin yadda ake yin ailoli. Za ka iya ba ni girke - gireken matakan daya bayan daya?	豪萨语	Emmanuel
Ina dafa abinci don bakin gobe kuma ina bukatar sanin yadda ake yin ailoli. Za ka iya ba ni girke - gireken matakan daya bayan daya?	豪萨语	Eniola
Èmi máa se oúnjẹ fún àwọn àlejò l'ọ́la mo sì nílò láti mọ bí wọn ti ńṣe aioli. Ṣe o lè fún mi ni àwọn ìlànà ìdáná ẹlẹ́sẹẹsẹ?	约鲁巴语	Eniola
I am cooking for guests tomorrow and need to know how to make aioli. Can you give me a step - by - step recipe.	英语	Eniola
M na - esi nri maka ndị ọbịa echi ma achọ ịmata otú esi esi aioli. Ị nwere ike inye m usoro ntụziaka?	伊博语	Eniola
M na - esi nri maka ndị ọbịa echi ma achọ ịmata otú esi esi aioli. Ị nwere ike inye m usoro ntụziaka?	伊博语	Lovelyn
I am cooking for guests tomorrow and need to know how to make aioli. Can you give me a step - by - step recipe.	英语	Lovelyn

🔧 技术细节

训练概述

基础模型：canopylabs/orpheus-3b-0.1-ft
训练引擎：Unsloth + LoRA
LoRA配置：r = 1024，alpha = 1024，dropout = 0.0，全注意力 + FFN适应
量化：训练时采用4位（bnb）；最终模型具有高效的内存使用效率
总步数：18,014（1个周期）
批量大小：1 × 4（梯度累积）
GPU：A100 40GB（最大使用55%显存）

步骤	训练损失	验证损失
5,000	3.9496	3.8790
10,000	3.8863	3.79497
15,000	3.8544	3.75323

数据集概述

来源：
- ✅ 手动对齐的YouTube转录（即随机数据）
- ✅ 来自Orpheus TTS的合成语音生成
- ✅ 非洲英语、伊博语、约鲁巴语、豪萨语的并行文本 - 音频对
总时长：300 + 小时（多口音）
关键说话者：45 + 种独特语音（见下面的说话者分布图表）

image/png

我们计划不久后像Hypa_Fleurs项目一样开源完整的数据集。

📄 许可证

本模型根据开源许可证（Apache - 2.0）发布。请参考LICENSE文件获取完整详情。

在您的工作中使用此模型时，请同时引用此模型以及基础模型canopylabs/orpheus-3b-0.1-ft，引用格式如下：

@misc{canopylabsorpheus,
  title={Orpheus-3b-0.1-ft: A Multilingual Text-to-Speech Model},
  author={Canopy Labs},
  year={2025},
  publisher={Hugging Face},
  howpublished={\url{https://huggingface.co/canopylabs/orpheus-3b-0.1-ft}},
  note={Fine-tuned version of Orpheus for expressive TTS}
}

@misc{hypaorpheus4bit,
  title={Hypa_Orpheus-3b-0.1-ft (LoRA-4bit)},
  author={Hypa AI},
  year={2025},
  note={Fine-tuned Orpheus TTS on African languages},
  url={https://huggingface.co/hypaai/Hypa_Orpheus-3b-0.1-ft-unsloth-bnb-4bit}
}

👏 致谢

Canopy Labs团队：创建了基础模型并将其开源。
AfroVoices专家：提供翻译专业知识和高质量数据集。
社区支持：感谢所有支持者、贡献者和用户。

📞 联系与贡献

如有任何问题、反馈或想要做出贡献，请在此仓库中创建一个issue，或联系hypa.ai.ng@gmail.com。欢迎大家贡献！

💬 结束语

通过推出Hypa_Orpheus，我们希望能够推动非洲语言多语言语音技术的研究与发展。 Hypa AI将坚定不移地致力于开创智能解决方案，这些方案不仅在技术上先进，而且具有文化意识，确保人工智能的未来能像其所服务的世界一样多样化和包容。 AfroVoices作为Hypa AI的子公司，致力于在智能时代放大非洲的声音、语言和文化。专注于弥合数字代表性差距，AfroVoices为非洲语言策划数据集和资源，促进人工智能技术中的包容性和文化欣赏。他们的使命超越了技术创新，旨在在全球舞台上展现非洲语言多样性的丰富内涵。

💻 使用示例

基础用法

Unsloth推理

下载所需的包：

%%capture
import os
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth
else:
    # Do this only in Colab notebooks! Otherwise use pip install unsloth
    !pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl==0.15.2 triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf datasets huggingface_hub hf_transfer
    !pip install --no-deps unsloth
!pip install snac

下载模型（包括SNAC编码器/解码器以及我们微调后的Hypa_Orpheus）：

import torch
from snac import SNAC
from unsloth import FastLanguageModel

dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "hypaai/Hypa_Orpheus-3b-0.1-ft-unsloth-merged_16bit",
    max_seq_length= 2048, # Choose any for long context!
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    #token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

snac_model = SNAC.from_pretrained("hubertsiuzdak/snac_24khz")
snac_model = snac_model.to("cuda")

创建文本提示，选择语音，并将其传入模型：

prompts = [
    """Mo nífẹ̀ẹ́sí láti ṣe Ph.D sùgbọ́n mi ò ì tíì pinnu ẹ̀ka tí màá ṣe. Àwọn anfaani tí óń dé oríṣiríṣi àwọn olùgbọ́ káàkiri àgbáyé wo ni mo ní""",
]
chosen_voice = "Eniola" # None for single-speaker


FastLanguageModel.for_inference(model) # Enable native 2x faster inference
snac_model.to("cpu")# Moving snac_model cuda to cpu

prompts_ = [(f"{chosen_voice}: " + p) if chosen_voice else p for p in prompts]

all_input_ids = []

for prompt in prompts_:
  input_ids = tokenizer(prompt, return_tensors="pt").input_ids
  all_input_ids.append(input_ids)

start_token = torch.tensor([[ 128259]], dtype=torch.int64) # Start of human
end_tokens = torch.tensor([[128009, 128260]], dtype=torch.int64) # End of text, End of human

all_modified_input_ids = []
for input_ids in all_input_ids:
  modified_input_ids = torch.cat([start_token, input_ids, end_tokens], dim=1) # SOH SOT Text EOT EOH
  all_modified_input_ids.append(modified_input_ids)

all_padded_tensors = []
all_attention_masks = []
max_length = max([modified_input_ids.shape[1] for modified_input_ids in all_modified_input_ids])
for modified_input_ids in all_modified_input_ids:
  padding = max_length - modified_input_ids.shape[1]
  padded_tensor = torch.cat([torch.full((1, padding), 128263, dtype=torch.int64), modified_input_ids], dim=1)
  attention_mask = torch.cat([torch.zeros((1, padding), dtype=torch.int64), torch.ones((1, modified_input_ids.shape[1]), dtype=torch.int64)], dim=1)
  all_padded_tensors.append(padded_tensor)
  all_attention_masks.append(attention_mask)

all_padded_tensors = torch.cat(all_padded_tensors, dim=0)
all_attention_masks = torch.cat(all_attention_masks, dim=0)

input_ids = all_padded_tensors.to("cuda")
attention_mask = all_attention_masks.to("cuda")
generated_ids = model.generate(
      input_ids=input_ids,
      attention_mask=attention_mask,
      max_new_tokens=1200,
      do_sample=True,
      temperature=0.6,
      top_p=0.95,
      repetition_penalty=1.1,
      num_return_sequences=1,
      eos_token_id=128258,
     use_cache = True
  )
token_to_find = 128257
token_to_remove = 128258

token_indices = (generated_ids == token_to_find).nonzero(as_tuple=True)

if len(token_indices[1]) > 0:
    last_occurrence_idx = token_indices[1][-1].item()
    cropped_tensor = generated_ids[:, last_occurrence_idx+1:]
else:
    cropped_tensor = generated_ids

mask = cropped_tensor != token_to_remove

processed_rows = []

for row in cropped_tensor:
    masked_row = row[row != token_to_remove]
    processed_rows.append(masked_row)

code_lists = []

for row in processed_rows:
    row_length = row.size(0)
    new_length = (row_length // 7) * 7
    trimmed_row = row[:new_length]
    trimmed_row = [t - 128266 for t in trimmed_row]
    code_lists.append(trimmed_row)


def redistribute_codes(code_list):
  layer_1 = []
  layer_2 = []
  layer_3 = []
  for i in range((len(code_list)+1)//7):
    layer_1.append(code_list[7*i])
    layer_2.append(code_list[7*i+1]-4096)
    layer_3.append(code_list[7*i+2]-(2*4096))
    layer_3.append(code_list[7*i+3]-(3*4096))
    layer_2.append(code_list[7*i+4]-(4*4096))
    layer_3.append(code_list[7*i+5]-(5*4096))
    layer_3.append(code_list[7*i+6]-(6*4096))
  codes = [torch.tensor(layer_1).unsqueeze(0),
         torch.tensor(layer_2).unsqueeze(0),
         torch.tensor(layer_3).unsqueeze(0)]

  # codes = [c.to("cuda") for c in codes]
  audio_hat = snac_model.decode(codes)
  return audio_hat

my_samples = []
for code_list in code_lists:
  samples = redistribute_codes(code_list)
  my_samples.append(samples)
from IPython.display import display, Audio
if len(prompts) != len(my_samples):
  raise Exception("Number of prompts and samples do not match")
else:
  for i in range(len(my_samples)):
    print(prompts[i])
    samples = my_samples[i]
    display(Audio(samples.detach().squeeze().to("cpu").numpy(), rate=24000))
# Clean up to save RAM
del my_samples,samples

标准推理

下载所需的包：

%%capture
!pip install snac ipywebrtc

下载模型（SNAC和Hypa_Orpheus）：

import torch
from transformers import AutoModelForCausalLM, Trainer, TrainingArguments, AutoTokenizer
from snac import SNAC

# Loads the pre-trained SNAC model and moves it to the CPU.
snac_model = SNAC.from_pretrained("hubertsiuzdak/snac_24khz")
snac_model = snac_model #.to("cpu")

print("We have loaded the Encoder/Decoder model to the cpu, to use vram - use the gpu for faster inference")

# Loading the Orpheus Model and Tokenizer, moving the model to the GPU for faster inference
model_name = "hypaai/Hypa_Orpheus-3b-0.1-ft-unsloth-merged_16bit"
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16)
model.cuda()
tokenizer = AutoTokenizer.from_pretrained(model_name)

创建提示并根据需要选择语音和情感：

# List of supported voices in Orpheus-TTS
voices = [
    "Eniola", "tara",   # Female, conversational, clear
    "Moyo", "leah",     # Female, warm, gentle
    "Gift", "jess",     # Female, energetic, youthful
    "Prince", "leo",    # Male, authoritative, deep
    "Emmanuel", "dan",   # Male, friendly, casual
    "Cynthia", "mia",    # Female, professional, articulate
    "Kolade", "zac",    # Male, enthusiastic, dynamic
    "Lovelyn", "zoe"     # Female, calm, soothing
]

# List of supported emotion tags in Orpheus-TTS
emotions = [
    "<laugh>",    # Laughter
    "<chuckle>",  # Soft chuckle
    "<sigh>",     # Sighing
    "<cough>",    # Coughing
    "<sniffle>",  # Sniffling
    "<groan>",    # Groaning
    "<yawn>",     # Yawning
    "<gasp>"      # Gasping
]

# Creating Prompts
prompts = [
    "Hey there my name is Eniola 9000,  and I'm a speech generation model that can sound like a person.",
    # "I've also been taught to understand and produce paralinguistic things like sighing, or chuckling, or yawning!",
    # "I live in San Francisco, and have, uhm let's see, 3 billion 7 hundred ... well, lets just say a lot of parameters.",
]

chosen_voice = "Eniola"  # "tara" # see github for other voices
prompts = [f"{chosen_voice}: " + p for p in prompts] # Creating the prompts (as a batch)
print(prompts)

将提示标记化为输入ID，进行填充并创建注意力掩码：

# Tokenizing each prompt into input IDs.
all_input_ids = []
for prompt in prompts:
  input_ids = tokenizer(prompt, return_tensors="pt").input_ids
  all_input_ids.append(input_ids)

# Adds special tokens to mark the beginning and end of each prompt
start_token = torch.tensor([[128259]], dtype=torch.int64) # Start of human
end_tokens = torch.tensor([[128009, 128260]], dtype=torch.int64) # End of text, End of human

all_modified_input_ids = []
for input_ids in all_input_ids:
  modified_input_ids = torch.cat([start_token, input_ids, end_tokens], dim=1) # SOH SOT Text EOT EOH
  all_modified_input_ids.append(modified_input_ids)

# Padding All sequences to same length and creating corresponding attention masks
all_padded_tensors = []
all_attention_masks = []
max_length = max([modified_input_ids.shape[1] for modified_input_ids in all_modified_input_ids])
for modified_input_ids in all_modified_input_ids:
  padding = max_length - modified_input_ids.shape[1]
  # Left Padding
  padded_tensor = torch.cat([torch.full((1, padding), 128263, dtype=torch.int64), modified_input_ids], dim=1)
  attention_mask = torch.cat([torch.zeros((1, padding), dtype=torch.int64), torch.ones((1, modified_input_ids.shape[1]), dtype=torch.int64)], dim=1)
  all_padded_tensors.append(padded_tensor)
  all_attention_masks.append(attention_mask)

all_padded_tensors = torch.cat(all_padded_tensors, dim=0)
all_attention_masks = torch.cat(all_attention_masks, dim=0)

# Moving all padded sequences to GPU for Faster computation
input_ids = all_padded_tensors.to("cuda")
attention_mask = all_attention_masks.to("cuda")

从模型生成输出标记并将输出标记解析为语音：

print("*** Model.generate is slow - see vllm implementation on github for realtime streaming and inference")
print("*** Increase/decrease inference params for more expressive less stable generations")

# Generating Output Tokens
with torch.no_grad():
  generated_ids = model.generate(
      input_ids=input_ids,
      attention_mask=attention_mask,
      max_new_tokens=1200,
      do_sample=True,
      temperature=0.6,
      top_p=0.95,
      repetition_penalty=1.1,
      num_return_sequences=1,
      eos_token_id=128258,
  )

# Processing Generated Tokens (Parse Output as speech)
token_to_find = 128257 # Start of Audio token (relevant output)
token_to_remove = 128258 # End/ Terminal Token (End of Audio/ relevant output)

token_indices = (generated_ids == token_to_find).nonzero(as_tuple=True)
print(token_indices)

# Slices the tensor to exclude unwanted tokens.
if len(token_indices[1]) > 0:
    last_occurrence_idx = token_indices[1][-1].item()
    cropped_tensor = generated_ids[:, last_occurrence_idx+1:]
else:
    cropped_tensor = generated_ids

# mask = cropped_tensor != token_to_remove

# Storing the cleaned-up token sequences#
processed_rows = []
for row in cropped_tensor:
    masked_row = row[row != token_to_remove]
    processed_rows.append(masked_row)

# Preparing (Audio Codes) the token sequences for audio decoding by trimming and adjusting token values.
code_lists = []
for row in processed_rows:
    row_length = row.size(0) #  Determines the length of the token sequence.
    new_length = (row_length // 7) * 7 # Ensures the sequence length is a multiple of 7, as required by the decoder.
    trimmed_row = row[:new_length]
    trimmed_row = [t - 128266 for t in trimmed_row] # Adjusts token values to match the expected input range for the decoder.
    code_lists.append(trimmed_row)

使用SNAC解码器解码输出：

# Processes the token sequences into the format expected by the SNAC decoder:
def redistribute_codes(code_list):
  """ Reorganizes the flattened token list into three separate layers, adjusting each token's value to align with the decoder's expectations"""
  layer_1 = [] # Coarsest layer
  layer_2 = [] # Intermediate layer
  layer_3 = [] # Finest layer

  num_groups = (len(code_list) + 1) // 7 #Calculate the number of complete 7-token groups in the code_list
  for i in range(num_groups):
    idx = 7 * i # starting index for the current group
    # Layer 1 receives the first token of the group
    layer_1.append(code_list[idx])

    # Layer 2 receives the second token, adjusted by subtracting 4096
    layer_2.append(code_list[idx + 1] - 4096)

    # Layer 3 receives the third and fourth tokens, adjusted by subtracting 8192 and 12288 respectively
    layer_3.append(code_list[idx+2]-(2*4096))
    layer_3.append(code_list[idx+3]-(3*4096))

    # Layer 2 receives the fifth token, adjusted by subtracting 16384
    layer_2.append(code_list[idx+4]-(4*4096))

    # Layer 3 receives the sixth and seventh tokens, adjusted by subtracting 20480 and 24576 respectively
    layer_3.append(code_list[idx+5]-(5*4096))
    layer_3.append(code_list[idx+6]-(6*4096))

  codes = [
        torch.tensor(layer_1).unsqueeze(0), # Shape: (1, len(layer_1))
        torch.tensor(layer_2).unsqueeze(0), # Shape: (1, len(layer_2))
        torch.tensor(layer_3).unsqueeze(0) # Shape: (1, len(layer_3))
        ]  # Convert the lists to PyTorch tensors and add a batch dimension
  audio_hat = snac_model.decode(codes) # Decode the structured codes into an audio waveform using the SNAC model
  return audio_hat

my_samples = []
for code_list in code_lists:
  samples = redistribute_codes(code_list) # Generates audio samples from the processed token sequences
  my_samples.append(samples)

# Display Audio
from IPython.display import display, Audio

if len(prompts) != len(my_samples):
  raise Exception("Number of prompts and samples do not match")
else:
  for i in range(len(my_samples)):
    print(prompts[i])
    samples = my_samples[i]
    display(Audio(samples.detach().squeeze().to("cpu").numpy(), rate=24000))