模型概述
模型特點
模型能力
使用案例
🚀 Hypa_Orpheus-3b-0.1-ft (merged 16-bit)
這是一個經過16位量化和合並處理的模型,是對canopylabs/orpheus-3b-0.1-ft
進行微調後的版本,具有高效的內存使用效率。它藉助Unsloth和LoRA進行了優化,適用於富有表現力的多語言文本轉語音(TTS)任務,尤其在處理低資源非洲語言方面表現出色。該模型具備以下能力:
- 文本轉語音生成
- 為代表性不足的口音進行語音合成
- 語音克隆與情感合成
- 多語言低資源語音AI研究
📚 詳細文檔
模型概述
本模型在一個並行文本 - 語音數據集上進行訓練,該數據集包含超過300小時(75k樣本)的尼日利亞口音和低資源語言音頻(伊博語、約魯巴語、豪薩語)。數據集中的關鍵部分來自AfroVoices對真實世界YouTube數據的轉錄(標記為隨機說話者,約100 + 小時)。 為了在避免災難性遺忘的同時保留並增強多語言能力,我們納入了使用默認情感提示從原始8種Orpheus語音中採樣的合成語音 - 文本數據。 最終的訓練集還包含了新的說話者,如:
- Eniola(40小時)– 女性,大膽、清晰
- Moyo(40小時)– 女性,專業、表達清晰
- Lovelyn(35小時)– 女性,溫暖、害羞
- Precious(30小時)– 女性,友好、溫柔
該模型在跨非洲語言的低資源多語言TTS任務中達到了當前的最優性能(見下面的訓練統計數據)。
基礎模型詳情
由Canopy Labs
發佈的默認Orpheus - TTS模型支持以下語音和情感:
- 語音:
tara
、leah
、jess
、leo
、dan
、mia
、zac
和zoe
。 - 情感:
<laugh>
、<chuckle>
、<sigh>
、<cough>
、<sniffle>
、<groan>
、<yawn>
和<gasp>
。
通過合成數據的生成和添加,我們的微調模型也保留了這些語音和情感。有關語音和情感的更多信息,請訪問默認模型的卡片頁面。
模型樣本生成
🎧 聆聽Hypa Orpheus - TTS生成的樣本
文本輸入 | 音頻輸出 | 語言 | 語音 |
---|---|---|---|
I am cooking for guests tomorrow and need to know how to make aioli. Can you give me a step - by - step recipe. | 英語 | Emmanuel | |
Ina dafa abinci don bakin gobe kuma ina bukatar sanin yadda ake yin ailoli. Za ka iya ba ni girke - gireken matakan daya bayan daya? | 豪薩語 | Emmanuel | |
Ina dafa abinci don bakin gobe kuma ina bukatar sanin yadda ake yin ailoli. Za ka iya ba ni girke - gireken matakan daya bayan daya? | 豪薩語 | Eniola | |
Èmi máa se oúnjẹ fún àwọn àlejò l'ọ́la mo sì nílò láti mọ bí wọn ti ńṣe aioli. Ṣe o lè fún mi ni àwọn ìlànà ìdáná ẹlẹ́sẹẹsẹ? | 約魯巴語 | Eniola | |
I am cooking for guests tomorrow and need to know how to make aioli. Can you give me a step - by - step recipe. | 英語 | Eniola | |
M na - esi nri maka ndị ọbịa echi ma achọ ịmata otú esi esi aioli. Ị nwere ike inye m usoro ntụziaka? | 伊博語 | Eniola | |
M na - esi nri maka ndị ọbịa echi ma achọ ịmata otú esi esi aioli. Ị nwere ike inye m usoro ntụziaka? | 伊博語 | Lovelyn | |
I am cooking for guests tomorrow and need to know how to make aioli. Can you give me a step - by - step recipe. | 英語 | Lovelyn |
🔧 技術細節
訓練概述
- 基礎模型:canopylabs/orpheus-3b-0.1-ft
- 訓練引擎:Unsloth + LoRA
- LoRA配置:r = 1024,alpha = 1024,dropout = 0.0,全注意力 + FFN適應
- 量化:訓練時採用4位(bnb);最終模型具有高效的內存使用效率
- 總步數:18,014(1個週期)
- 批量大小:1 × 4(梯度累積)
- GPU:A100 40GB(最大使用55%顯存)
步驟 | 訓練損失 | 驗證損失 |
---|---|---|
5,000 | 3.9496 | 3.8790 |
10,000 | 3.8863 | 3.79497 |
15,000 | 3.8544 | 3.75323 |
數據集概述
- 來源:
- ✅ 手動對齊的YouTube轉錄(即隨機數據)
- ✅ 來自Orpheus TTS的合成語音生成
- ✅ 非洲英語、伊博語、約魯巴語、豪薩語的並行文本 - 音頻對
- 總時長:300 + 小時(多口音)
- 關鍵說話者:45 + 種獨特語音(見下面的說話者分佈圖表)
我們計劃不久後像Hypa_Fleurs項目一樣開源完整的數據集。
📄 許可證
本模型根據開源許可證(Apache - 2.0)發佈。請參考LICENSE文件獲取完整詳情。
在您的工作中使用此模型時,請同時引用此模型以及基礎模型canopylabs/orpheus-3b-0.1-ft
,引用格式如下:
@misc{canopylabsorpheus,
title={Orpheus-3b-0.1-ft: A Multilingual Text-to-Speech Model},
author={Canopy Labs},
year={2025},
publisher={Hugging Face},
howpublished={\url{https://huggingface.co/canopylabs/orpheus-3b-0.1-ft}},
note={Fine-tuned version of Orpheus for expressive TTS}
}
@misc{hypaorpheus4bit,
title={Hypa_Orpheus-3b-0.1-ft (LoRA-4bit)},
author={Hypa AI},
year={2025},
note={Fine-tuned Orpheus TTS on African languages},
url={https://huggingface.co/hypaai/Hypa_Orpheus-3b-0.1-ft-unsloth-bnb-4bit}
}
👏 致謝
- Canopy Labs團隊:創建了基礎模型並將其開源。
- AfroVoices專家:提供翻譯專業知識和高質量數據集。
- 社區支持:感謝所有支持者、貢獻者和用戶。
📞 聯繫與貢獻
如有任何問題、反饋或想要做出貢獻,請在此倉庫中創建一個issue,或聯繫hypa.ai.ng@gmail.com。歡迎大家貢獻!
💬 結束語
通過推出Hypa_Orpheus,我們希望能夠推動非洲語言多語言語音技術的研究與發展。 Hypa AI將堅定不移地致力於開創智能解決方案,這些方案不僅在技術上先進,而且具有文化意識,確保人工智能的未來能像其所服務的世界一樣多樣化和包容。 AfroVoices作為Hypa AI的子公司,致力於在智能時代放大非洲的聲音、語言和文化。專注於彌合數字代表性差距,AfroVoices為非洲語言策劃數據集和資源,促進人工智能技術中的包容性和文化欣賞。他們的使命超越了技術創新,旨在在全球舞臺上展現非洲語言多樣性的豐富內涵。
💻 使用示例
基礎用法
Unsloth推理
下載所需的包:
%%capture
import os
if "COLAB_" not in "".join(os.environ.keys()):
!pip install unsloth
else:
# Do this only in Colab notebooks! Otherwise use pip install unsloth
!pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl==0.15.2 triton cut_cross_entropy unsloth_zoo
!pip install sentencepiece protobuf datasets huggingface_hub hf_transfer
!pip install --no-deps unsloth
!pip install snac
下載模型(包括SNAC編碼器/解碼器以及我們微調後的Hypa_Orpheus):
import torch
from snac import SNAC
from unsloth import FastLanguageModel
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "hypaai/Hypa_Orpheus-3b-0.1-ft-unsloth-merged_16bit",
max_seq_length= 2048, # Choose any for long context!
dtype = dtype,
load_in_4bit = load_in_4bit,
#token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)
snac_model = SNAC.from_pretrained("hubertsiuzdak/snac_24khz")
snac_model = snac_model.to("cuda")
創建文本提示,選擇語音,並將其傳入模型:
prompts = [
"""Mo nífẹ̀ẹ́sí láti ṣe Ph.D sùgbọ́n mi ò ì tíì pinnu ẹ̀ka tí màá ṣe. Àwọn anfaani tí óń dé oríṣiríṣi àwọn olùgbọ́ káàkiri àgbáyé wo ni mo ní""",
]
chosen_voice = "Eniola" # None for single-speaker
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
snac_model.to("cpu")# Moving snac_model cuda to cpu
prompts_ = [(f"{chosen_voice}: " + p) if chosen_voice else p for p in prompts]
all_input_ids = []
for prompt in prompts_:
input_ids = tokenizer(prompt, return_tensors="pt").input_ids
all_input_ids.append(input_ids)
start_token = torch.tensor([[ 128259]], dtype=torch.int64) # Start of human
end_tokens = torch.tensor([[128009, 128260]], dtype=torch.int64) # End of text, End of human
all_modified_input_ids = []
for input_ids in all_input_ids:
modified_input_ids = torch.cat([start_token, input_ids, end_tokens], dim=1) # SOH SOT Text EOT EOH
all_modified_input_ids.append(modified_input_ids)
all_padded_tensors = []
all_attention_masks = []
max_length = max([modified_input_ids.shape[1] for modified_input_ids in all_modified_input_ids])
for modified_input_ids in all_modified_input_ids:
padding = max_length - modified_input_ids.shape[1]
padded_tensor = torch.cat([torch.full((1, padding), 128263, dtype=torch.int64), modified_input_ids], dim=1)
attention_mask = torch.cat([torch.zeros((1, padding), dtype=torch.int64), torch.ones((1, modified_input_ids.shape[1]), dtype=torch.int64)], dim=1)
all_padded_tensors.append(padded_tensor)
all_attention_masks.append(attention_mask)
all_padded_tensors = torch.cat(all_padded_tensors, dim=0)
all_attention_masks = torch.cat(all_attention_masks, dim=0)
input_ids = all_padded_tensors.to("cuda")
attention_mask = all_attention_masks.to("cuda")
generated_ids = model.generate(
input_ids=input_ids,
attention_mask=attention_mask,
max_new_tokens=1200,
do_sample=True,
temperature=0.6,
top_p=0.95,
repetition_penalty=1.1,
num_return_sequences=1,
eos_token_id=128258,
use_cache = True
)
token_to_find = 128257
token_to_remove = 128258
token_indices = (generated_ids == token_to_find).nonzero(as_tuple=True)
if len(token_indices[1]) > 0:
last_occurrence_idx = token_indices[1][-1].item()
cropped_tensor = generated_ids[:, last_occurrence_idx+1:]
else:
cropped_tensor = generated_ids
mask = cropped_tensor != token_to_remove
processed_rows = []
for row in cropped_tensor:
masked_row = row[row != token_to_remove]
processed_rows.append(masked_row)
code_lists = []
for row in processed_rows:
row_length = row.size(0)
new_length = (row_length // 7) * 7
trimmed_row = row[:new_length]
trimmed_row = [t - 128266 for t in trimmed_row]
code_lists.append(trimmed_row)
def redistribute_codes(code_list):
layer_1 = []
layer_2 = []
layer_3 = []
for i in range((len(code_list)+1)//7):
layer_1.append(code_list[7*i])
layer_2.append(code_list[7*i+1]-4096)
layer_3.append(code_list[7*i+2]-(2*4096))
layer_3.append(code_list[7*i+3]-(3*4096))
layer_2.append(code_list[7*i+4]-(4*4096))
layer_3.append(code_list[7*i+5]-(5*4096))
layer_3.append(code_list[7*i+6]-(6*4096))
codes = [torch.tensor(layer_1).unsqueeze(0),
torch.tensor(layer_2).unsqueeze(0),
torch.tensor(layer_3).unsqueeze(0)]
# codes = [c.to("cuda") for c in codes]
audio_hat = snac_model.decode(codes)
return audio_hat
my_samples = []
for code_list in code_lists:
samples = redistribute_codes(code_list)
my_samples.append(samples)
from IPython.display import display, Audio
if len(prompts) != len(my_samples):
raise Exception("Number of prompts and samples do not match")
else:
for i in range(len(my_samples)):
print(prompts[i])
samples = my_samples[i]
display(Audio(samples.detach().squeeze().to("cpu").numpy(), rate=24000))
# Clean up to save RAM
del my_samples,samples
標準推理
下載所需的包:
%%capture
!pip install snac ipywebrtc
下載模型(SNAC和Hypa_Orpheus):
import torch
from transformers import AutoModelForCausalLM, Trainer, TrainingArguments, AutoTokenizer
from snac import SNAC
# Loads the pre-trained SNAC model and moves it to the CPU.
snac_model = SNAC.from_pretrained("hubertsiuzdak/snac_24khz")
snac_model = snac_model #.to("cpu")
print("We have loaded the Encoder/Decoder model to the cpu, to use vram - use the gpu for faster inference")
# Loading the Orpheus Model and Tokenizer, moving the model to the GPU for faster inference
model_name = "hypaai/Hypa_Orpheus-3b-0.1-ft-unsloth-merged_16bit"
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16)
model.cuda()
tokenizer = AutoTokenizer.from_pretrained(model_name)
創建提示並根據需要選擇語音和情感:
# List of supported voices in Orpheus-TTS
voices = [
"Eniola", "tara", # Female, conversational, clear
"Moyo", "leah", # Female, warm, gentle
"Gift", "jess", # Female, energetic, youthful
"Prince", "leo", # Male, authoritative, deep
"Emmanuel", "dan", # Male, friendly, casual
"Cynthia", "mia", # Female, professional, articulate
"Kolade", "zac", # Male, enthusiastic, dynamic
"Lovelyn", "zoe" # Female, calm, soothing
]
# List of supported emotion tags in Orpheus-TTS
emotions = [
"<laugh>", # Laughter
"<chuckle>", # Soft chuckle
"<sigh>", # Sighing
"<cough>", # Coughing
"<sniffle>", # Sniffling
"<groan>", # Groaning
"<yawn>", # Yawning
"<gasp>" # Gasping
]
# Creating Prompts
prompts = [
"Hey there my name is Eniola 9000, and I'm a speech generation model that can sound like a person.",
# "I've also been taught to understand and produce paralinguistic things like sighing, or chuckling, or yawning!",
# "I live in San Francisco, and have, uhm let's see, 3 billion 7 hundred ... well, lets just say a lot of parameters.",
]
chosen_voice = "Eniola" # "tara" # see github for other voices
prompts = [f"{chosen_voice}: " + p for p in prompts] # Creating the prompts (as a batch)
print(prompts)
將提示標記化為輸入ID,進行填充並創建注意力掩碼:
# Tokenizing each prompt into input IDs.
all_input_ids = []
for prompt in prompts:
input_ids = tokenizer(prompt, return_tensors="pt").input_ids
all_input_ids.append(input_ids)
# Adds special tokens to mark the beginning and end of each prompt
start_token = torch.tensor([[128259]], dtype=torch.int64) # Start of human
end_tokens = torch.tensor([[128009, 128260]], dtype=torch.int64) # End of text, End of human
all_modified_input_ids = []
for input_ids in all_input_ids:
modified_input_ids = torch.cat([start_token, input_ids, end_tokens], dim=1) # SOH SOT Text EOT EOH
all_modified_input_ids.append(modified_input_ids)
# Padding All sequences to same length and creating corresponding attention masks
all_padded_tensors = []
all_attention_masks = []
max_length = max([modified_input_ids.shape[1] for modified_input_ids in all_modified_input_ids])
for modified_input_ids in all_modified_input_ids:
padding = max_length - modified_input_ids.shape[1]
# Left Padding
padded_tensor = torch.cat([torch.full((1, padding), 128263, dtype=torch.int64), modified_input_ids], dim=1)
attention_mask = torch.cat([torch.zeros((1, padding), dtype=torch.int64), torch.ones((1, modified_input_ids.shape[1]), dtype=torch.int64)], dim=1)
all_padded_tensors.append(padded_tensor)
all_attention_masks.append(attention_mask)
all_padded_tensors = torch.cat(all_padded_tensors, dim=0)
all_attention_masks = torch.cat(all_attention_masks, dim=0)
# Moving all padded sequences to GPU for Faster computation
input_ids = all_padded_tensors.to("cuda")
attention_mask = all_attention_masks.to("cuda")
從模型生成輸出標記並將輸出標記解析為語音:
print("*** Model.generate is slow - see vllm implementation on github for realtime streaming and inference")
print("*** Increase/decrease inference params for more expressive less stable generations")
# Generating Output Tokens
with torch.no_grad():
generated_ids = model.generate(
input_ids=input_ids,
attention_mask=attention_mask,
max_new_tokens=1200,
do_sample=True,
temperature=0.6,
top_p=0.95,
repetition_penalty=1.1,
num_return_sequences=1,
eos_token_id=128258,
)
# Processing Generated Tokens (Parse Output as speech)
token_to_find = 128257 # Start of Audio token (relevant output)
token_to_remove = 128258 # End/ Terminal Token (End of Audio/ relevant output)
token_indices = (generated_ids == token_to_find).nonzero(as_tuple=True)
print(token_indices)
# Slices the tensor to exclude unwanted tokens.
if len(token_indices[1]) > 0:
last_occurrence_idx = token_indices[1][-1].item()
cropped_tensor = generated_ids[:, last_occurrence_idx+1:]
else:
cropped_tensor = generated_ids
# mask = cropped_tensor != token_to_remove
# Storing the cleaned-up token sequences#
processed_rows = []
for row in cropped_tensor:
masked_row = row[row != token_to_remove]
processed_rows.append(masked_row)
# Preparing (Audio Codes) the token sequences for audio decoding by trimming and adjusting token values.
code_lists = []
for row in processed_rows:
row_length = row.size(0) # Determines the length of the token sequence.
new_length = (row_length // 7) * 7 # Ensures the sequence length is a multiple of 7, as required by the decoder.
trimmed_row = row[:new_length]
trimmed_row = [t - 128266 for t in trimmed_row] # Adjusts token values to match the expected input range for the decoder.
code_lists.append(trimmed_row)
使用SNAC解碼器解碼輸出:
# Processes the token sequences into the format expected by the SNAC decoder:
def redistribute_codes(code_list):
""" Reorganizes the flattened token list into three separate layers, adjusting each token's value to align with the decoder's expectations"""
layer_1 = [] # Coarsest layer
layer_2 = [] # Intermediate layer
layer_3 = [] # Finest layer
num_groups = (len(code_list) + 1) // 7 #Calculate the number of complete 7-token groups in the code_list
for i in range(num_groups):
idx = 7 * i # starting index for the current group
# Layer 1 receives the first token of the group
layer_1.append(code_list[idx])
# Layer 2 receives the second token, adjusted by subtracting 4096
layer_2.append(code_list[idx + 1] - 4096)
# Layer 3 receives the third and fourth tokens, adjusted by subtracting 8192 and 12288 respectively
layer_3.append(code_list[idx+2]-(2*4096))
layer_3.append(code_list[idx+3]-(3*4096))
# Layer 2 receives the fifth token, adjusted by subtracting 16384
layer_2.append(code_list[idx+4]-(4*4096))
# Layer 3 receives the sixth and seventh tokens, adjusted by subtracting 20480 and 24576 respectively
layer_3.append(code_list[idx+5]-(5*4096))
layer_3.append(code_list[idx+6]-(6*4096))
codes = [
torch.tensor(layer_1).unsqueeze(0), # Shape: (1, len(layer_1))
torch.tensor(layer_2).unsqueeze(0), # Shape: (1, len(layer_2))
torch.tensor(layer_3).unsqueeze(0) # Shape: (1, len(layer_3))
] # Convert the lists to PyTorch tensors and add a batch dimension
audio_hat = snac_model.decode(codes) # Decode the structured codes into an audio waveform using the SNAC model
return audio_hat
my_samples = []
for code_list in code_lists:
samples = redistribute_codes(code_list) # Generates audio samples from the processed token sequences
my_samples.append(samples)
# Display Audio
from IPython.display import display, Audio
if len(prompts) != len(my_samples):
raise Exception("Number of prompts and samples do not match")
else:
for i in range(len(my_samples)):
print(prompts[i])
samples = my_samples[i]
display(Audio(samples.detach().squeeze().to("cpu").numpy(), rate=24000))
- 倉庫:[N/A]
- 論文:[N/A]
- 演示:[N/A]
這個基於Llama的模型使用Unsloth和Huggingface的TRL庫進行訓練,速度提升了2倍。




