Hypa_Orpheus-3b開源多語言文本轉語音模型 - 助力非洲語，支持克隆與情感合成

首頁

Hypa Orpheus 3b 0.1 Ft Unsloth Merged 16bit

由hypaai開發

基於Orpheus-3b微調的多語言文本轉語音模型，專為非洲低資源語言優化，支持語音克隆與情感合成

語音合成

Transformers

支持多種語言開源協議:Apache-2.0 #非洲語言TTS #語音克隆 #低資源優化

下載量 47

發布時間 : 4/21/2025

模型概述

這是經過16位量化和合並、內存優化的Orpheus微調版本，採用Unsloth和LoRA技術優化，專為富有表現力的多語言文本轉語音設計，尤其適用於非洲低資源語言。

模型特點

非洲語言優化

專門針對伊博語、約魯巴語、豪薩語等非洲低資源語言進行優化

語音克隆

支持個性化語音克隆，可模仿特定說話人的聲音特徵

情感合成

能夠生成帶有笑聲、嘆氣等情感特徵的語音

高效推理

採用4位量化和LoRA技術優化，內存佔用低，推理效率高

模型能力

多語言文本轉語音

語音克隆

情感語音合成

低資源語言支持

使用案例

教育

非洲語言學習輔助

為學習非洲語言的學習者提供發音示範

生成自然流暢的伊博語、約魯巴語等語音

無障礙技術

非洲語言屏幕閱讀器

為視障人士提供非洲語言的文本轉語音服務

支持多種非洲語言的語音輸出

媒體制作

本地化內容配音

為非洲地區的媒體內容提供本地化配音

生成帶有當地口音和文化特色的語音

🚀 Hypa_Orpheus-3b-0.1-ft (merged 16-bit)

這是一個經過16位量化和合並處理的模型，是對canopylabs/orpheus-3b-0.1-ft進行微調後的版本，具有高效的內存使用效率。它藉助Unsloth和LoRA進行了優化，適用於富有表現力的多語言文本轉語音（TTS）任務，尤其在處理低資源非洲語言方面表現出色。該模型具備以下能力：

文本轉語音生成
為代表性不足的口音進行語音合成
語音克隆與情感合成
多語言低資源語音AI研究

📚 詳細文檔

模型概述

本模型在一個並行文本 - 語音數據集上進行訓練，該數據集包含超過300小時（75k樣本）的尼日利亞口音和低資源語言音頻（伊博語、約魯巴語、豪薩語）。數據集中的關鍵部分來自AfroVoices對真實世界YouTube數據的轉錄（標記為隨機說話者，約100 + 小時）。為了在避免災難性遺忘的同時保留並增強多語言能力，我們納入了使用默認情感提示從原始8種Orpheus語音中採樣的合成語音 - 文本數據。最終的訓練集還包含了新的說話者，如：

Eniola（40小時）– 女性，大膽、清晰
Moyo（40小時）– 女性，專業、表達清晰
Lovelyn（35小時）– 女性，溫暖、害羞
Precious（30小時）– 女性，友好、溫柔

該模型在跨非洲語言的低資源多語言TTS任務中達到了當前的最優性能（見下面的訓練統計數據）。

基礎模型詳情

由Canopy Labs發佈的默認Orpheus - TTS模型支持以下語音和情感：

語音：tara、leah、jess、leo、dan、mia、zac和zoe。
情感：<laugh>、<chuckle>、<sigh>、<cough>、<sniffle>、<groan>、<yawn>和<gasp>。

通過合成數據的生成和添加，我們的微調模型也保留了這些語音和情感。有關語音和情感的更多信息，請訪問默認模型的卡片頁面。

模型樣本生成

🎧 聆聽Hypa Orpheus - TTS生成的樣本

文本輸入	語言	語音
I am cooking for guests tomorrow and need to know how to make aioli. Can you give me a step - by - step recipe.	英語	Emmanuel
Ina dafa abinci don bakin gobe kuma ina bukatar sanin yadda ake yin ailoli. Za ka iya ba ni girke - gireken matakan daya bayan daya?	豪薩語	Emmanuel
Ina dafa abinci don bakin gobe kuma ina bukatar sanin yadda ake yin ailoli. Za ka iya ba ni girke - gireken matakan daya bayan daya?	豪薩語	Eniola
Èmi máa se oúnjẹ fún àwọn àlejò l'ọ́la mo sì nílò láti mọ bí wọn ti ńṣe aioli. Ṣe o lè fún mi ni àwọn ìlànà ìdáná ẹlẹ́sẹẹsẹ?	約魯巴語	Eniola
I am cooking for guests tomorrow and need to know how to make aioli. Can you give me a step - by - step recipe.	英語	Eniola
M na - esi nri maka ndị ọbịa echi ma achọ ịmata otú esi esi aioli. Ị nwere ike inye m usoro ntụziaka?	伊博語	Eniola
M na - esi nri maka ndị ọbịa echi ma achọ ịmata otú esi esi aioli. Ị nwere ike inye m usoro ntụziaka?	伊博語	Lovelyn
I am cooking for guests tomorrow and need to know how to make aioli. Can you give me a step - by - step recipe.	英語	Lovelyn

🔧 技術細節

訓練概述

基礎模型：canopylabs/orpheus-3b-0.1-ft
訓練引擎：Unsloth + LoRA
LoRA配置：r = 1024，alpha = 1024，dropout = 0.0，全注意力 + FFN適應
量化：訓練時採用4位（bnb）；最終模型具有高效的內存使用效率
總步數：18,014（1個週期）
批量大小：1 × 4（梯度累積）
GPU：A100 40GB（最大使用55%顯存）

步驟	訓練損失	驗證損失
5,000	3.9496	3.8790
10,000	3.8863	3.79497
15,000	3.8544	3.75323

數據集概述

來源：
- ✅ 手動對齊的YouTube轉錄（即隨機數據）
- ✅ 來自Orpheus TTS的合成語音生成
- ✅ 非洲英語、伊博語、約魯巴語、豪薩語的並行文本 - 音頻對
總時長：300 + 小時（多口音）
關鍵說話者：45 + 種獨特語音（見下面的說話者分佈圖表）

image/png

我們計劃不久後像Hypa_Fleurs項目一樣開源完整的數據集。

📄 許可證

本模型根據開源許可證（Apache - 2.0）發佈。請參考LICENSE文件獲取完整詳情。

在您的工作中使用此模型時，請同時引用此模型以及基礎模型canopylabs/orpheus-3b-0.1-ft，引用格式如下：

@misc{canopylabsorpheus,
  title={Orpheus-3b-0.1-ft: A Multilingual Text-to-Speech Model},
  author={Canopy Labs},
  year={2025},
  publisher={Hugging Face},
  howpublished={\url{https://huggingface.co/canopylabs/orpheus-3b-0.1-ft}},
  note={Fine-tuned version of Orpheus for expressive TTS}
}

@misc{hypaorpheus4bit,
  title={Hypa_Orpheus-3b-0.1-ft (LoRA-4bit)},
  author={Hypa AI},
  year={2025},
  note={Fine-tuned Orpheus TTS on African languages},
  url={https://huggingface.co/hypaai/Hypa_Orpheus-3b-0.1-ft-unsloth-bnb-4bit}
}

👏 致謝

Canopy Labs團隊：創建了基礎模型並將其開源。
AfroVoices專家：提供翻譯專業知識和高質量數據集。
社區支持：感謝所有支持者、貢獻者和用戶。

📞 聯繫與貢獻

如有任何問題、反饋或想要做出貢獻，請在此倉庫中創建一個issue，或聯繫hypa.ai.ng@gmail.com。歡迎大家貢獻！

💬 結束語

通過推出Hypa_Orpheus，我們希望能夠推動非洲語言多語言語音技術的研究與發展。 Hypa AI將堅定不移地致力於開創智能解決方案，這些方案不僅在技術上先進，而且具有文化意識，確保人工智能的未來能像其所服務的世界一樣多樣化和包容。 AfroVoices作為Hypa AI的子公司，致力於在智能時代放大非洲的聲音、語言和文化。專注於彌合數字代表性差距，AfroVoices為非洲語言策劃數據集和資源，促進人工智能技術中的包容性和文化欣賞。他們的使命超越了技術創新，旨在在全球舞臺上展現非洲語言多樣性的豐富內涵。

💻 使用示例

基礎用法

Unsloth推理

下載所需的包：

%%capture
import os
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth
else:
    # Do this only in Colab notebooks! Otherwise use pip install unsloth
    !pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl==0.15.2 triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf datasets huggingface_hub hf_transfer
    !pip install --no-deps unsloth
!pip install snac

下載模型（包括SNAC編碼器/解碼器以及我們微調後的Hypa_Orpheus）：

import torch
from snac import SNAC
from unsloth import FastLanguageModel

dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "hypaai/Hypa_Orpheus-3b-0.1-ft-unsloth-merged_16bit",
    max_seq_length= 2048, # Choose any for long context!
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    #token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

snac_model = SNAC.from_pretrained("hubertsiuzdak/snac_24khz")
snac_model = snac_model.to("cuda")

創建文本提示，選擇語音，並將其傳入模型：

prompts = [
    """Mo nífẹ̀ẹ́sí láti ṣe Ph.D sùgbọ́n mi ò ì tíì pinnu ẹ̀ka tí màá ṣe. Àwọn anfaani tí óń dé oríṣiríṣi àwọn olùgbọ́ káàkiri àgbáyé wo ni mo ní""",
]
chosen_voice = "Eniola" # None for single-speaker


FastLanguageModel.for_inference(model) # Enable native 2x faster inference
snac_model.to("cpu")# Moving snac_model cuda to cpu

prompts_ = [(f"{chosen_voice}: " + p) if chosen_voice else p for p in prompts]

all_input_ids = []

for prompt in prompts_:
  input_ids = tokenizer(prompt, return_tensors="pt").input_ids
  all_input_ids.append(input_ids)

start_token = torch.tensor([[ 128259]], dtype=torch.int64) # Start of human
end_tokens = torch.tensor([[128009, 128260]], dtype=torch.int64) # End of text, End of human

all_modified_input_ids = []
for input_ids in all_input_ids:
  modified_input_ids = torch.cat([start_token, input_ids, end_tokens], dim=1) # SOH SOT Text EOT EOH
  all_modified_input_ids.append(modified_input_ids)

all_padded_tensors = []
all_attention_masks = []
max_length = max([modified_input_ids.shape[1] for modified_input_ids in all_modified_input_ids])
for modified_input_ids in all_modified_input_ids:
  padding = max_length - modified_input_ids.shape[1]
  padded_tensor = torch.cat([torch.full((1, padding), 128263, dtype=torch.int64), modified_input_ids], dim=1)
  attention_mask = torch.cat([torch.zeros((1, padding), dtype=torch.int64), torch.ones((1, modified_input_ids.shape[1]), dtype=torch.int64)], dim=1)
  all_padded_tensors.append(padded_tensor)
  all_attention_masks.append(attention_mask)

all_padded_tensors = torch.cat(all_padded_tensors, dim=0)
all_attention_masks = torch.cat(all_attention_masks, dim=0)

input_ids = all_padded_tensors.to("cuda")
attention_mask = all_attention_masks.to("cuda")
generated_ids = model.generate(
      input_ids=input_ids,
      attention_mask=attention_mask,
      max_new_tokens=1200,
      do_sample=True,
      temperature=0.6,
      top_p=0.95,
      repetition_penalty=1.1,
      num_return_sequences=1,
      eos_token_id=128258,
     use_cache = True
  )
token_to_find = 128257
token_to_remove = 128258

token_indices = (generated_ids == token_to_find).nonzero(as_tuple=True)

if len(token_indices[1]) > 0:
    last_occurrence_idx = token_indices[1][-1].item()
    cropped_tensor = generated_ids[:, last_occurrence_idx+1:]
else:
    cropped_tensor = generated_ids

mask = cropped_tensor != token_to_remove

processed_rows = []

for row in cropped_tensor:
    masked_row = row[row != token_to_remove]
    processed_rows.append(masked_row)

code_lists = []

for row in processed_rows:
    row_length = row.size(0)
    new_length = (row_length // 7) * 7
    trimmed_row = row[:new_length]
    trimmed_row = [t - 128266 for t in trimmed_row]
    code_lists.append(trimmed_row)


def redistribute_codes(code_list):
  layer_1 = []
  layer_2 = []
  layer_3 = []
  for i in range((len(code_list)+1)//7):
    layer_1.append(code_list[7*i])
    layer_2.append(code_list[7*i+1]-4096)
    layer_3.append(code_list[7*i+2]-(2*4096))
    layer_3.append(code_list[7*i+3]-(3*4096))
    layer_2.append(code_list[7*i+4]-(4*4096))
    layer_3.append(code_list[7*i+5]-(5*4096))
    layer_3.append(code_list[7*i+6]-(6*4096))
  codes = [torch.tensor(layer_1).unsqueeze(0),
         torch.tensor(layer_2).unsqueeze(0),
         torch.tensor(layer_3).unsqueeze(0)]

  # codes = [c.to("cuda") for c in codes]
  audio_hat = snac_model.decode(codes)
  return audio_hat

my_samples = []
for code_list in code_lists:
  samples = redistribute_codes(code_list)
  my_samples.append(samples)
from IPython.display import display, Audio
if len(prompts) != len(my_samples):
  raise Exception("Number of prompts and samples do not match")
else:
  for i in range(len(my_samples)):
    print(prompts[i])
    samples = my_samples[i]
    display(Audio(samples.detach().squeeze().to("cpu").numpy(), rate=24000))
# Clean up to save RAM
del my_samples,samples

標準推理

下載所需的包：

%%capture
!pip install snac ipywebrtc

下載模型（SNAC和Hypa_Orpheus）：

import torch
from transformers import AutoModelForCausalLM, Trainer, TrainingArguments, AutoTokenizer
from snac import SNAC

# Loads the pre-trained SNAC model and moves it to the CPU.
snac_model = SNAC.from_pretrained("hubertsiuzdak/snac_24khz")
snac_model = snac_model #.to("cpu")

print("We have loaded the Encoder/Decoder model to the cpu, to use vram - use the gpu for faster inference")

# Loading the Orpheus Model and Tokenizer, moving the model to the GPU for faster inference
model_name = "hypaai/Hypa_Orpheus-3b-0.1-ft-unsloth-merged_16bit"
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16)
model.cuda()
tokenizer = AutoTokenizer.from_pretrained(model_name)

創建提示並根據需要選擇語音和情感：

# List of supported voices in Orpheus-TTS
voices = [
    "Eniola", "tara",   # Female, conversational, clear
    "Moyo", "leah",     # Female, warm, gentle
    "Gift", "jess",     # Female, energetic, youthful
    "Prince", "leo",    # Male, authoritative, deep
    "Emmanuel", "dan",   # Male, friendly, casual
    "Cynthia", "mia",    # Female, professional, articulate
    "Kolade", "zac",    # Male, enthusiastic, dynamic
    "Lovelyn", "zoe"     # Female, calm, soothing
]

# List of supported emotion tags in Orpheus-TTS
emotions = [
    "<laugh>",    # Laughter
    "<chuckle>",  # Soft chuckle
    "<sigh>",     # Sighing
    "<cough>",    # Coughing
    "<sniffle>",  # Sniffling
    "<groan>",    # Groaning
    "<yawn>",     # Yawning
    "<gasp>"      # Gasping
]

# Creating Prompts
prompts = [
    "Hey there my name is Eniola 9000,  and I'm a speech generation model that can sound like a person.",
    # "I've also been taught to understand and produce paralinguistic things like sighing, or chuckling, or yawning!",
    # "I live in San Francisco, and have, uhm let's see, 3 billion 7 hundred ... well, lets just say a lot of parameters.",
]

chosen_voice = "Eniola"  # "tara" # see github for other voices
prompts = [f"{chosen_voice}: " + p for p in prompts] # Creating the prompts (as a batch)
print(prompts)

將提示標記化為輸入ID，進行填充並創建注意力掩碼：

# Tokenizing each prompt into input IDs.
all_input_ids = []
for prompt in prompts:
  input_ids = tokenizer(prompt, return_tensors="pt").input_ids
  all_input_ids.append(input_ids)

# Adds special tokens to mark the beginning and end of each prompt
start_token = torch.tensor([[128259]], dtype=torch.int64) # Start of human
end_tokens = torch.tensor([[128009, 128260]], dtype=torch.int64) # End of text, End of human

all_modified_input_ids = []
for input_ids in all_input_ids:
  modified_input_ids = torch.cat([start_token, input_ids, end_tokens], dim=1) # SOH SOT Text EOT EOH
  all_modified_input_ids.append(modified_input_ids)

# Padding All sequences to same length and creating corresponding attention masks
all_padded_tensors = []
all_attention_masks = []
max_length = max([modified_input_ids.shape[1] for modified_input_ids in all_modified_input_ids])
for modified_input_ids in all_modified_input_ids:
  padding = max_length - modified_input_ids.shape[1]
  # Left Padding
  padded_tensor = torch.cat([torch.full((1, padding), 128263, dtype=torch.int64), modified_input_ids], dim=1)
  attention_mask = torch.cat([torch.zeros((1, padding), dtype=torch.int64), torch.ones((1, modified_input_ids.shape[1]), dtype=torch.int64)], dim=1)
  all_padded_tensors.append(padded_tensor)
  all_attention_masks.append(attention_mask)

all_padded_tensors = torch.cat(all_padded_tensors, dim=0)
all_attention_masks = torch.cat(all_attention_masks, dim=0)

# Moving all padded sequences to GPU for Faster computation
input_ids = all_padded_tensors.to("cuda")
attention_mask = all_attention_masks.to("cuda")

從模型生成輸出標記並將輸出標記解析為語音：

print("*** Model.generate is slow - see vllm implementation on github for realtime streaming and inference")
print("*** Increase/decrease inference params for more expressive less stable generations")

# Generating Output Tokens
with torch.no_grad():
  generated_ids = model.generate(
      input_ids=input_ids,
      attention_mask=attention_mask,
      max_new_tokens=1200,
      do_sample=True,
      temperature=0.6,
      top_p=0.95,
      repetition_penalty=1.1,
      num_return_sequences=1,
      eos_token_id=128258,
  )

# Processing Generated Tokens (Parse Output as speech)
token_to_find = 128257 # Start of Audio token (relevant output)
token_to_remove = 128258 # End/ Terminal Token (End of Audio/ relevant output)

token_indices = (generated_ids == token_to_find).nonzero(as_tuple=True)
print(token_indices)

# Slices the tensor to exclude unwanted tokens.
if len(token_indices[1]) > 0:
    last_occurrence_idx = token_indices[1][-1].item()
    cropped_tensor = generated_ids[:, last_occurrence_idx+1:]
else:
    cropped_tensor = generated_ids

# mask = cropped_tensor != token_to_remove

# Storing the cleaned-up token sequences#
processed_rows = []
for row in cropped_tensor:
    masked_row = row[row != token_to_remove]
    processed_rows.append(masked_row)

# Preparing (Audio Codes) the token sequences for audio decoding by trimming and adjusting token values.
code_lists = []
for row in processed_rows:
    row_length = row.size(0) #  Determines the length of the token sequence.
    new_length = (row_length // 7) * 7 # Ensures the sequence length is a multiple of 7, as required by the decoder.
    trimmed_row = row[:new_length]
    trimmed_row = [t - 128266 for t in trimmed_row] # Adjusts token values to match the expected input range for the decoder.
    code_lists.append(trimmed_row)

使用SNAC解碼器解碼輸出：

# Processes the token sequences into the format expected by the SNAC decoder:
def redistribute_codes(code_list):
  """ Reorganizes the flattened token list into three separate layers, adjusting each token's value to align with the decoder's expectations"""
  layer_1 = [] # Coarsest layer
  layer_2 = [] # Intermediate layer
  layer_3 = [] # Finest layer

  num_groups = (len(code_list) + 1) // 7 #Calculate the number of complete 7-token groups in the code_list
  for i in range(num_groups):
    idx = 7 * i # starting index for the current group
    # Layer 1 receives the first token of the group
    layer_1.append(code_list[idx])

    # Layer 2 receives the second token, adjusted by subtracting 4096
    layer_2.append(code_list[idx + 1] - 4096)

    # Layer 3 receives the third and fourth tokens, adjusted by subtracting 8192 and 12288 respectively
    layer_3.append(code_list[idx+2]-(2*4096))
    layer_3.append(code_list[idx+3]-(3*4096))

    # Layer 2 receives the fifth token, adjusted by subtracting 16384
    layer_2.append(code_list[idx+4]-(4*4096))

    # Layer 3 receives the sixth and seventh tokens, adjusted by subtracting 20480 and 24576 respectively
    layer_3.append(code_list[idx+5]-(5*4096))
    layer_3.append(code_list[idx+6]-(6*4096))

  codes = [
        torch.tensor(layer_1).unsqueeze(0), # Shape: (1, len(layer_1))
        torch.tensor(layer_2).unsqueeze(0), # Shape: (1, len(layer_2))
        torch.tensor(layer_3).unsqueeze(0) # Shape: (1, len(layer_3))
        ]  # Convert the lists to PyTorch tensors and add a batch dimension
  audio_hat = snac_model.decode(codes) # Decode the structured codes into an audio waveform using the SNAC model
  return audio_hat

my_samples = []
for code_list in code_lists:
  samples = redistribute_codes(code_list) # Generates audio samples from the processed token sequences
  my_samples.append(samples)

# Display Audio
from IPython.display import display, Audio

if len(prompts) != len(my_samples):
  raise Exception("Number of prompts and samples do not match")
else:
  for i in range(len(my_samples)):
    print(prompts[i])
    samples = my_samples[i]
    display(Audio(samples.detach().squeeze().to("cpu").numpy(), rate=24000))