模型简介
模型特点
模型能力
使用案例
🚀 Hypa_Orpheus-3b-0.1-ft (merged 16-bit)
这是一个经过16位量化和合并处理的模型,是对canopylabs/orpheus-3b-0.1-ft
进行微调后的版本,具有高效的内存使用效率。它借助Unsloth和LoRA进行了优化,适用于富有表现力的多语言文本转语音(TTS)任务,尤其在处理低资源非洲语言方面表现出色。该模型具备以下能力:
- 文本转语音生成
- 为代表性不足的口音进行语音合成
- 语音克隆与情感合成
- 多语言低资源语音AI研究
📚 详细文档
模型概述
本模型在一个并行文本 - 语音数据集上进行训练,该数据集包含超过300小时(75k样本)的尼日利亚口音和低资源语言音频(伊博语、约鲁巴语、豪萨语)。数据集中的关键部分来自AfroVoices对真实世界YouTube数据的转录(标记为随机说话者,约100 + 小时)。 为了在避免灾难性遗忘的同时保留并增强多语言能力,我们纳入了使用默认情感提示从原始8种Orpheus语音中采样的合成语音 - 文本数据。 最终的训练集还包含了新的说话者,如:
- Eniola(40小时)– 女性,大胆、清晰
- Moyo(40小时)– 女性,专业、表达清晰
- Lovelyn(35小时)– 女性,温暖、害羞
- Precious(30小时)– 女性,友好、温柔
该模型在跨非洲语言的低资源多语言TTS任务中达到了当前的最优性能(见下面的训练统计数据)。
基础模型详情
由Canopy Labs
发布的默认Orpheus - TTS模型支持以下语音和情感:
- 语音:
tara
、leah
、jess
、leo
、dan
、mia
、zac
和zoe
。 - 情感:
<laugh>
、<chuckle>
、<sigh>
、<cough>
、<sniffle>
、<groan>
、<yawn>
和<gasp>
。
通过合成数据的生成和添加,我们的微调模型也保留了这些语音和情感。有关语音和情感的更多信息,请访问默认模型的卡片页面。
模型样本生成
🎧 聆听Hypa Orpheus - TTS生成的样本
文本输入 | 音频输出 | 语言 | 语音 |
---|---|---|---|
I am cooking for guests tomorrow and need to know how to make aioli. Can you give me a step - by - step recipe. | 英语 | Emmanuel | |
Ina dafa abinci don bakin gobe kuma ina bukatar sanin yadda ake yin ailoli. Za ka iya ba ni girke - gireken matakan daya bayan daya? | 豪萨语 | Emmanuel | |
Ina dafa abinci don bakin gobe kuma ina bukatar sanin yadda ake yin ailoli. Za ka iya ba ni girke - gireken matakan daya bayan daya? | 豪萨语 | Eniola | |
Èmi máa se oúnjẹ fún àwọn àlejò l'ọ́la mo sì nílò láti mọ bí wọn ti ńṣe aioli. Ṣe o lè fún mi ni àwọn ìlànà ìdáná ẹlẹ́sẹẹsẹ? | 约鲁巴语 | Eniola | |
I am cooking for guests tomorrow and need to know how to make aioli. Can you give me a step - by - step recipe. | 英语 | Eniola | |
M na - esi nri maka ndị ọbịa echi ma achọ ịmata otú esi esi aioli. Ị nwere ike inye m usoro ntụziaka? | 伊博语 | Eniola | |
M na - esi nri maka ndị ọbịa echi ma achọ ịmata otú esi esi aioli. Ị nwere ike inye m usoro ntụziaka? | 伊博语 | Lovelyn | |
I am cooking for guests tomorrow and need to know how to make aioli. Can you give me a step - by - step recipe. | 英语 | Lovelyn |
🔧 技术细节
训练概述
- 基础模型:canopylabs/orpheus-3b-0.1-ft
- 训练引擎:Unsloth + LoRA
- LoRA配置:r = 1024,alpha = 1024,dropout = 0.0,全注意力 + FFN适应
- 量化:训练时采用4位(bnb);最终模型具有高效的内存使用效率
- 总步数:18,014(1个周期)
- 批量大小:1 × 4(梯度累积)
- GPU:A100 40GB(最大使用55%显存)
步骤 | 训练损失 | 验证损失 |
---|---|---|
5,000 | 3.9496 | 3.8790 |
10,000 | 3.8863 | 3.79497 |
15,000 | 3.8544 | 3.75323 |
数据集概述
- 来源:
- ✅ 手动对齐的YouTube转录(即随机数据)
- ✅ 来自Orpheus TTS的合成语音生成
- ✅ 非洲英语、伊博语、约鲁巴语、豪萨语的并行文本 - 音频对
- 总时长:300 + 小时(多口音)
- 关键说话者:45 + 种独特语音(见下面的说话者分布图表)
我们计划不久后像Hypa_Fleurs项目一样开源完整的数据集。
📄 许可证
本模型根据开源许可证(Apache - 2.0)发布。请参考LICENSE文件获取完整详情。
在您的工作中使用此模型时,请同时引用此模型以及基础模型canopylabs/orpheus-3b-0.1-ft
,引用格式如下:
@misc{canopylabsorpheus,
title={Orpheus-3b-0.1-ft: A Multilingual Text-to-Speech Model},
author={Canopy Labs},
year={2025},
publisher={Hugging Face},
howpublished={\url{https://huggingface.co/canopylabs/orpheus-3b-0.1-ft}},
note={Fine-tuned version of Orpheus for expressive TTS}
}
@misc{hypaorpheus4bit,
title={Hypa_Orpheus-3b-0.1-ft (LoRA-4bit)},
author={Hypa AI},
year={2025},
note={Fine-tuned Orpheus TTS on African languages},
url={https://huggingface.co/hypaai/Hypa_Orpheus-3b-0.1-ft-unsloth-bnb-4bit}
}
👏 致谢
- Canopy Labs团队:创建了基础模型并将其开源。
- AfroVoices专家:提供翻译专业知识和高质量数据集。
- 社区支持:感谢所有支持者、贡献者和用户。
📞 联系与贡献
如有任何问题、反馈或想要做出贡献,请在此仓库中创建一个issue,或联系hypa.ai.ng@gmail.com。欢迎大家贡献!
💬 结束语
通过推出Hypa_Orpheus,我们希望能够推动非洲语言多语言语音技术的研究与发展。 Hypa AI将坚定不移地致力于开创智能解决方案,这些方案不仅在技术上先进,而且具有文化意识,确保人工智能的未来能像其所服务的世界一样多样化和包容。 AfroVoices作为Hypa AI的子公司,致力于在智能时代放大非洲的声音、语言和文化。专注于弥合数字代表性差距,AfroVoices为非洲语言策划数据集和资源,促进人工智能技术中的包容性和文化欣赏。他们的使命超越了技术创新,旨在在全球舞台上展现非洲语言多样性的丰富内涵。
💻 使用示例
基础用法
Unsloth推理
下载所需的包:
%%capture
import os
if "COLAB_" not in "".join(os.environ.keys()):
!pip install unsloth
else:
# Do this only in Colab notebooks! Otherwise use pip install unsloth
!pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl==0.15.2 triton cut_cross_entropy unsloth_zoo
!pip install sentencepiece protobuf datasets huggingface_hub hf_transfer
!pip install --no-deps unsloth
!pip install snac
下载模型(包括SNAC编码器/解码器以及我们微调后的Hypa_Orpheus):
import torch
from snac import SNAC
from unsloth import FastLanguageModel
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "hypaai/Hypa_Orpheus-3b-0.1-ft-unsloth-merged_16bit",
max_seq_length= 2048, # Choose any for long context!
dtype = dtype,
load_in_4bit = load_in_4bit,
#token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)
snac_model = SNAC.from_pretrained("hubertsiuzdak/snac_24khz")
snac_model = snac_model.to("cuda")
创建文本提示,选择语音,并将其传入模型:
prompts = [
"""Mo nífẹ̀ẹ́sí láti ṣe Ph.D sùgbọ́n mi ò ì tíì pinnu ẹ̀ka tí màá ṣe. Àwọn anfaani tí óń dé oríṣiríṣi àwọn olùgbọ́ káàkiri àgbáyé wo ni mo ní""",
]
chosen_voice = "Eniola" # None for single-speaker
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
snac_model.to("cpu")# Moving snac_model cuda to cpu
prompts_ = [(f"{chosen_voice}: " + p) if chosen_voice else p for p in prompts]
all_input_ids = []
for prompt in prompts_:
input_ids = tokenizer(prompt, return_tensors="pt").input_ids
all_input_ids.append(input_ids)
start_token = torch.tensor([[ 128259]], dtype=torch.int64) # Start of human
end_tokens = torch.tensor([[128009, 128260]], dtype=torch.int64) # End of text, End of human
all_modified_input_ids = []
for input_ids in all_input_ids:
modified_input_ids = torch.cat([start_token, input_ids, end_tokens], dim=1) # SOH SOT Text EOT EOH
all_modified_input_ids.append(modified_input_ids)
all_padded_tensors = []
all_attention_masks = []
max_length = max([modified_input_ids.shape[1] for modified_input_ids in all_modified_input_ids])
for modified_input_ids in all_modified_input_ids:
padding = max_length - modified_input_ids.shape[1]
padded_tensor = torch.cat([torch.full((1, padding), 128263, dtype=torch.int64), modified_input_ids], dim=1)
attention_mask = torch.cat([torch.zeros((1, padding), dtype=torch.int64), torch.ones((1, modified_input_ids.shape[1]), dtype=torch.int64)], dim=1)
all_padded_tensors.append(padded_tensor)
all_attention_masks.append(attention_mask)
all_padded_tensors = torch.cat(all_padded_tensors, dim=0)
all_attention_masks = torch.cat(all_attention_masks, dim=0)
input_ids = all_padded_tensors.to("cuda")
attention_mask = all_attention_masks.to("cuda")
generated_ids = model.generate(
input_ids=input_ids,
attention_mask=attention_mask,
max_new_tokens=1200,
do_sample=True,
temperature=0.6,
top_p=0.95,
repetition_penalty=1.1,
num_return_sequences=1,
eos_token_id=128258,
use_cache = True
)
token_to_find = 128257
token_to_remove = 128258
token_indices = (generated_ids == token_to_find).nonzero(as_tuple=True)
if len(token_indices[1]) > 0:
last_occurrence_idx = token_indices[1][-1].item()
cropped_tensor = generated_ids[:, last_occurrence_idx+1:]
else:
cropped_tensor = generated_ids
mask = cropped_tensor != token_to_remove
processed_rows = []
for row in cropped_tensor:
masked_row = row[row != token_to_remove]
processed_rows.append(masked_row)
code_lists = []
for row in processed_rows:
row_length = row.size(0)
new_length = (row_length // 7) * 7
trimmed_row = row[:new_length]
trimmed_row = [t - 128266 for t in trimmed_row]
code_lists.append(trimmed_row)
def redistribute_codes(code_list):
layer_1 = []
layer_2 = []
layer_3 = []
for i in range((len(code_list)+1)//7):
layer_1.append(code_list[7*i])
layer_2.append(code_list[7*i+1]-4096)
layer_3.append(code_list[7*i+2]-(2*4096))
layer_3.append(code_list[7*i+3]-(3*4096))
layer_2.append(code_list[7*i+4]-(4*4096))
layer_3.append(code_list[7*i+5]-(5*4096))
layer_3.append(code_list[7*i+6]-(6*4096))
codes = [torch.tensor(layer_1).unsqueeze(0),
torch.tensor(layer_2).unsqueeze(0),
torch.tensor(layer_3).unsqueeze(0)]
# codes = [c.to("cuda") for c in codes]
audio_hat = snac_model.decode(codes)
return audio_hat
my_samples = []
for code_list in code_lists:
samples = redistribute_codes(code_list)
my_samples.append(samples)
from IPython.display import display, Audio
if len(prompts) != len(my_samples):
raise Exception("Number of prompts and samples do not match")
else:
for i in range(len(my_samples)):
print(prompts[i])
samples = my_samples[i]
display(Audio(samples.detach().squeeze().to("cpu").numpy(), rate=24000))
# Clean up to save RAM
del my_samples,samples
标准推理
下载所需的包:
%%capture
!pip install snac ipywebrtc
下载模型(SNAC和Hypa_Orpheus):
import torch
from transformers import AutoModelForCausalLM, Trainer, TrainingArguments, AutoTokenizer
from snac import SNAC
# Loads the pre-trained SNAC model and moves it to the CPU.
snac_model = SNAC.from_pretrained("hubertsiuzdak/snac_24khz")
snac_model = snac_model #.to("cpu")
print("We have loaded the Encoder/Decoder model to the cpu, to use vram - use the gpu for faster inference")
# Loading the Orpheus Model and Tokenizer, moving the model to the GPU for faster inference
model_name = "hypaai/Hypa_Orpheus-3b-0.1-ft-unsloth-merged_16bit"
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16)
model.cuda()
tokenizer = AutoTokenizer.from_pretrained(model_name)
创建提示并根据需要选择语音和情感:
# List of supported voices in Orpheus-TTS
voices = [
"Eniola", "tara", # Female, conversational, clear
"Moyo", "leah", # Female, warm, gentle
"Gift", "jess", # Female, energetic, youthful
"Prince", "leo", # Male, authoritative, deep
"Emmanuel", "dan", # Male, friendly, casual
"Cynthia", "mia", # Female, professional, articulate
"Kolade", "zac", # Male, enthusiastic, dynamic
"Lovelyn", "zoe" # Female, calm, soothing
]
# List of supported emotion tags in Orpheus-TTS
emotions = [
"<laugh>", # Laughter
"<chuckle>", # Soft chuckle
"<sigh>", # Sighing
"<cough>", # Coughing
"<sniffle>", # Sniffling
"<groan>", # Groaning
"<yawn>", # Yawning
"<gasp>" # Gasping
]
# Creating Prompts
prompts = [
"Hey there my name is Eniola 9000, and I'm a speech generation model that can sound like a person.",
# "I've also been taught to understand and produce paralinguistic things like sighing, or chuckling, or yawning!",
# "I live in San Francisco, and have, uhm let's see, 3 billion 7 hundred ... well, lets just say a lot of parameters.",
]
chosen_voice = "Eniola" # "tara" # see github for other voices
prompts = [f"{chosen_voice}: " + p for p in prompts] # Creating the prompts (as a batch)
print(prompts)
将提示标记化为输入ID,进行填充并创建注意力掩码:
# Tokenizing each prompt into input IDs.
all_input_ids = []
for prompt in prompts:
input_ids = tokenizer(prompt, return_tensors="pt").input_ids
all_input_ids.append(input_ids)
# Adds special tokens to mark the beginning and end of each prompt
start_token = torch.tensor([[128259]], dtype=torch.int64) # Start of human
end_tokens = torch.tensor([[128009, 128260]], dtype=torch.int64) # End of text, End of human
all_modified_input_ids = []
for input_ids in all_input_ids:
modified_input_ids = torch.cat([start_token, input_ids, end_tokens], dim=1) # SOH SOT Text EOT EOH
all_modified_input_ids.append(modified_input_ids)
# Padding All sequences to same length and creating corresponding attention masks
all_padded_tensors = []
all_attention_masks = []
max_length = max([modified_input_ids.shape[1] for modified_input_ids in all_modified_input_ids])
for modified_input_ids in all_modified_input_ids:
padding = max_length - modified_input_ids.shape[1]
# Left Padding
padded_tensor = torch.cat([torch.full((1, padding), 128263, dtype=torch.int64), modified_input_ids], dim=1)
attention_mask = torch.cat([torch.zeros((1, padding), dtype=torch.int64), torch.ones((1, modified_input_ids.shape[1]), dtype=torch.int64)], dim=1)
all_padded_tensors.append(padded_tensor)
all_attention_masks.append(attention_mask)
all_padded_tensors = torch.cat(all_padded_tensors, dim=0)
all_attention_masks = torch.cat(all_attention_masks, dim=0)
# Moving all padded sequences to GPU for Faster computation
input_ids = all_padded_tensors.to("cuda")
attention_mask = all_attention_masks.to("cuda")
从模型生成输出标记并将输出标记解析为语音:
print("*** Model.generate is slow - see vllm implementation on github for realtime streaming and inference")
print("*** Increase/decrease inference params for more expressive less stable generations")
# Generating Output Tokens
with torch.no_grad():
generated_ids = model.generate(
input_ids=input_ids,
attention_mask=attention_mask,
max_new_tokens=1200,
do_sample=True,
temperature=0.6,
top_p=0.95,
repetition_penalty=1.1,
num_return_sequences=1,
eos_token_id=128258,
)
# Processing Generated Tokens (Parse Output as speech)
token_to_find = 128257 # Start of Audio token (relevant output)
token_to_remove = 128258 # End/ Terminal Token (End of Audio/ relevant output)
token_indices = (generated_ids == token_to_find).nonzero(as_tuple=True)
print(token_indices)
# Slices the tensor to exclude unwanted tokens.
if len(token_indices[1]) > 0:
last_occurrence_idx = token_indices[1][-1].item()
cropped_tensor = generated_ids[:, last_occurrence_idx+1:]
else:
cropped_tensor = generated_ids
# mask = cropped_tensor != token_to_remove
# Storing the cleaned-up token sequences#
processed_rows = []
for row in cropped_tensor:
masked_row = row[row != token_to_remove]
processed_rows.append(masked_row)
# Preparing (Audio Codes) the token sequences for audio decoding by trimming and adjusting token values.
code_lists = []
for row in processed_rows:
row_length = row.size(0) # Determines the length of the token sequence.
new_length = (row_length // 7) * 7 # Ensures the sequence length is a multiple of 7, as required by the decoder.
trimmed_row = row[:new_length]
trimmed_row = [t - 128266 for t in trimmed_row] # Adjusts token values to match the expected input range for the decoder.
code_lists.append(trimmed_row)
使用SNAC解码器解码输出:
# Processes the token sequences into the format expected by the SNAC decoder:
def redistribute_codes(code_list):
""" Reorganizes the flattened token list into three separate layers, adjusting each token's value to align with the decoder's expectations"""
layer_1 = [] # Coarsest layer
layer_2 = [] # Intermediate layer
layer_3 = [] # Finest layer
num_groups = (len(code_list) + 1) // 7 #Calculate the number of complete 7-token groups in the code_list
for i in range(num_groups):
idx = 7 * i # starting index for the current group
# Layer 1 receives the first token of the group
layer_1.append(code_list[idx])
# Layer 2 receives the second token, adjusted by subtracting 4096
layer_2.append(code_list[idx + 1] - 4096)
# Layer 3 receives the third and fourth tokens, adjusted by subtracting 8192 and 12288 respectively
layer_3.append(code_list[idx+2]-(2*4096))
layer_3.append(code_list[idx+3]-(3*4096))
# Layer 2 receives the fifth token, adjusted by subtracting 16384
layer_2.append(code_list[idx+4]-(4*4096))
# Layer 3 receives the sixth and seventh tokens, adjusted by subtracting 20480 and 24576 respectively
layer_3.append(code_list[idx+5]-(5*4096))
layer_3.append(code_list[idx+6]-(6*4096))
codes = [
torch.tensor(layer_1).unsqueeze(0), # Shape: (1, len(layer_1))
torch.tensor(layer_2).unsqueeze(0), # Shape: (1, len(layer_2))
torch.tensor(layer_3).unsqueeze(0) # Shape: (1, len(layer_3))
] # Convert the lists to PyTorch tensors and add a batch dimension
audio_hat = snac_model.decode(codes) # Decode the structured codes into an audio waveform using the SNAC model
return audio_hat
my_samples = []
for code_list in code_lists:
samples = redistribute_codes(code_list) # Generates audio samples from the processed token sequences
my_samples.append(samples)
# Display Audio
from IPython.display import display, Audio
if len(prompts) != len(my_samples):
raise Exception("Number of prompts and samples do not match")
else:
for i in range(len(my_samples)):
print(prompts[i])
samples = my_samples[i]
display(Audio(samples.detach().squeeze().to("cpu").numpy(), rate=24000))
- 仓库:[N/A]
- 论文:[N/A]
- 演示:[N/A]
这个基于Llama的模型使用Unsloth和Huggingface的TRL库进行训练,速度提升了2倍。




