YarnGPT開源文本轉語音模型 - 免費合成尼日利亞口音英語，用於多樣應用

首頁

Yarngpt

由saheedniyi開發

YarnGPT 是一款專為合成尼日利亞口音英語而設計的文本轉語音（TTS）模型，採用純語言建模技術，可為多樣化應用提供高質量、自然且文化相關的語音合成。

語音合成

Transformers

英語開源協議:Apache-2.0 #尼日利亞口音合成 #純語言建模TTS #文化相關語音生成

下載量 124

發布時間 : 12/31/2024

模型概述

YarnGPT 是一款基於語言建模的文本轉語音模型，專門設計用於生成尼日利亞口音的英語語音。它無需外部適配器或複雜架構，支持約11種不同聲音（6男5女）。

模型特點

尼日利亞口音支持

專門優化用於生成自然流暢的尼日利亞口音英語語音

多聲音支持

提供11種不同聲音選擇（6男5女），滿足多樣化需求

純語言建模技術

無需外部適配器或複雜架構，簡化部署流程

文化相關性

語音輸出具有文化相關性，適合尼日利亞本地應用場景

模型能力

文本轉語音

尼日利亞口音英語合成

多聲音選擇

長文本語音合成

使用案例

媒體與娛樂

新聞播報

生成具有尼日利亞口音的新聞播報語音

自然流暢的新聞播報效果

教育

教育內容朗讀

為教育內容生成本地化語音

提高尼日利亞學生的學習體驗

客戶服務

IVR系統語音

為尼日利亞地區的客戶服務系統生成語音提示

更親切的本地化客戶體驗

🚀 YarnGPT

YarnGPT是一個文本轉語音（TTS）模型，它藉助純語言建模技術，無需外部適配器或複雜架構，就能合成帶有尼日利亞口音的英語語音。該模型可為各類應用提供高質量、自然且具文化相關性的語音合成服務。

🚀 快速開始

模型使用說明

模型本身可以生成音頻，但使用語音來提示模型效果更佳。默認支持約11種語音（6男5女）：

zainab
jude
tayo
remi
idera（默認且效果最佳的語音）
regina
chinenye
umar
osagie
joke
emma（這些名字與任何部落或口音無關）

提示YarnGPT

# clone the YarnGPT repo to get access to the `audiotokenizer`
!git clone https://github.com/saheedniyi02/yarngpt.git


# install some necessary libraries
!pip install outetts==0.2.3 uroman

#import some important packages 
import os
import re
import json
import torch
import inflect
import random
import uroman as ur
import numpy as np
import torchaudio
import IPython
from transformers import AutoModelForCausalLM, AutoTokenizer
from outetts.wav_tokenizer.decoder import WavTokenizer
from yarngpt.audiotokenizer import AudioTokenizer


# download the wavtokenizer weights and config (to encode and decode the audio)
!wget https://huggingface.co/novateur/WavTokenizer-medium-speech-75token/resolve/main/wavtokenizer_mediumdata_frame75_3s_nq1_code4096_dim512_kmeans200_attn.yaml
!gdown 1-ASeEkrn4HY49yZWHTASgfGFNXdVnLTt

# model path and wavtokenizer weight path (the paths are assumed based on Google colab, a different environment might save the weights to a different location).
hf_path="saheedniyi/YarnGPT"
wav_tokenizer_config_path="/content/wavtokenizer_mediumdata_frame75_3s_nq1_code4096_dim512_kmeans200_attn.yaml"
wav_tokenizer_model_path = "/content/wavtokenizer_large_speech_320_24k.ckpt"

# create the AudioTokenizer object 
audio_tokenizer=AudioTokenizer(
    hf_path,wav_tokenizer_model_path,wav_tokenizer_config_path
)

#load the model weights

model = AutoModelForCausalLM.from_pretrained(hf_path,torch_dtype="auto").to(audio_tokenizer.device)

# your input text
text="Uhm, so, what was the inspiration behind your latest project? Like, was there a specific moment where you were like, 'Yeah, this is it!' Or, you know, did it just kind of, uh, come together naturally over time?"

# creating a prompt, when creating a prompt, there is an optional `speaker_name` parameter, the possible speakers are "idera","emma","jude","osagie","tayo","zainab","joke","regina","remi","umar","chinenye" if no speaker is selected a speaker is chosen at random 
prompt=audio_tokenizer.create_prompt(text,"idera")

# tokenize the prompt
input_ids=audio_tokenizer.tokenize_prompt(prompt)

# generate output from the model, you can tune the `.generate` parameters as you wish
output  = model.generate(
            input_ids=input_ids,
            temperature=0.1,
            repetition_penalty=1.1,
            max_length=4000,
        )

# convert the output to "audio codes"
codes=audio_tokenizer.get_codes(output)

# converts the codes to audio 
audio=audio_tokenizer.get_audio(codes)

# play the audio
IPython.display.Audio(audio,rate=24000)

# save the audio 
torchaudio.save(f"audio.wav", audio, sample_rate=24000)

簡單的尼日利亞口音新聞閱讀器

!git clone https://github.com/saheedniyi02/yarngpt.git

# install some necessary libraries
!pip install outetts uroman trafilatura pydub

import os
import re
import json
import torch
import inflect
import random
import requests
import trafilatura
import inflect
import uroman as ur
import numpy as np
import torchaudio
import IPython
from pydub import AudioSegment
from pydub.effects import normalize
from transformers import AutoModelForCausalLM, AutoTokenizer
from outetts.wav_tokenizer.decoder import WavTokenizer


!wget https://huggingface.co/novateur/WavTokenizer-medium-speech-75token/resolve/main/wavtokenizer_mediumdata_frame75_3s_nq1_code4096_dim512_kmeans200_attn.yaml
!gdown 1-ASeEkrn4HY49yZWHTASgfGFNXdVnLTt

from yarngpt.audiotokenizer import AudioTokenizer

tokenizer_path="saheedniyi/YarnGPT"
wav_tokenizer_config_path="/content/wavtokenizer_mediumdata_frame75_3s_nq1_code4096_dim512_kmeans200_attn.yaml"
wav_tokenizer_model_path = "/content/wavtokenizer_large_speech_320_24k.ckpt"



audio_tokenizer=AudioTokenizer(
    tokenizer_path,wav_tokenizer_model_path,wav_tokenizer_config_path
       )


model = AutoModelForCausalLM.from_pretrained(tokenizer_path,torch_dtype="auto").to(audio_tokenizer.device)


def split_text_into_chunks(text, word_limit=25):
  """ 
  Function to split a long web page into reasonable chunks
  """
  sentences=[sentence.strip() for sentence in text.split('.') if sentence.strip()]
  chunks=[]
  for sentence in sentences:
    chunks.append(".")
    sentence_splitted=sentence.split(" ")
    num_words=len(sentence_splitted)
    start_index=0
    if num_words>word_limit:
      while start_index<num_words:
        end_index=min(num_words,start_index+word_limit)
        chunks.append(" ".join(sentence_splitted[start_index:start_index+word_limit]))
        start_index=end_index
    else:
      chunks.append(sentence)
  return chunks

#Extracting the content of a webpage
page=requests.get("https://punchng.com/expensive-feud-how-burna-boy-cubana-chief-priests-fight-led-to-dollar-rain/")
content=trafilatura.extract(page.text)
chunks=split_text_into_chunks(content)

#Looping over the chunks and adding creating a large `all_codes` list
all_codes=[]
for i,chunk in enumerate(chunks):
  print(i)
  print("\n")
  print(chunk)
  if chunk==".":
    #add silence for 0.25 seconds if we encounter a full stop
    all_codes.extend([453]*20)
  else:
    prompt=audio_tokenizer.create_prompt(chunk,"chinenye")
    input_ids=audio_tokenizer.tokenize_prompt(prompt)
    output  = model.generate(
            input_ids=input_ids,
            temperature=0.1,
            repetition_penalty=1.1,
            max_length=4000,
        )
    codes=audio_tokenizer.get_codes(output)
    all_codes.extend(codes)


# Converting to audio
audio=audio_tokenizer.get_audio(all_codes)
IPython.display.Audio(audio,rate=24000)
torchaudio.save(f"news1.wav", audio, sample_rate=24000)

✨ 主要特性

基於純語言建模，無需外部適配器或複雜架構。
能夠合成高質量、自然且具文化相關性的尼日利亞口音英語語音。
默認支持11種不同的語音。

📦 安裝指南

在代碼示例中，有相關的安裝命令，如：

!git clone https://github.com/saheedniyi02/yarngpt.git
!pip install outetts==0.2.3 uroman
!pip install outetts uroman trafilatura pydub

📚 詳細文檔

模型描述

開發者：Saheedniyi
模型類型：文本轉語音
語言（NLP）：英語 --> 尼日利亞口音英語
微調基礎模型：HuggingFaceTB/SmolLM2 - 360M
倉庫地址：YarnGPT Github Repository
論文：正在撰寫中。
演示：
1. Prompt YarnGPT notebook
2. Simple news reader

用途

用於實驗性地生成尼日利亞口音的英語語音。

不適用場景

該模型不適用於生成英語以外的語言或其他口音的語音。

🔧 技術細節

偏差、風險和侷限性

該模型可能無法涵蓋尼日利亞口音的全部多樣性，並且可能會基於訓練數據集表現出偏差。此外，模型訓練所使用的大量文本是自動生成的，這可能會影響其性能。

建議

用戶（包括直接用戶和下游用戶）應該瞭解該模型的風險、偏差和侷限性。鼓勵提供反饋並貢獻多樣化的訓練數據。

語音樣本

可以收聽YarnGPT生成的樣本：

輸入	音頻	備註
Uhm, so, what was the inspiration behind your latest project? Like, was there a specific moment where you were like, 'Yeah, this is it!' Or, you know, did it just kind of, uh, come together naturally over time?		(temperature=0.1, repetition_penalty=1.1), voice: idera
Wizkid, Davido, Burna Boy perform at same event in Lagos. This event has sparked many reactions across social media, with fans and critics alike praising the artistes' performances and the rare opportunity to see the three music giants on the same stage.		(temperature=0.1, repetition_penalty=1.1), voice: jude
Since Nigeria became a republic in 1963, 14 individuals have served as head of state of Nigeria under different titles. The incumbent president Bola Tinubu is the nation's 16th head of state.		(temperature=0.1, repetition_penalty=1.1), voice: zainab, 模型在發音 `in 1963` 時存在困難
I visited the President, who has shown great concern for the security of Plateau State, especially considering that just a year ago, our state was in mourning. The President’s commitment to addressing these challenges has been steadfast.		(temperature=0.1, repetition_penalty=1.1), voice: emma
Scientists have discovered a new planet that may be capable of supporting life!		(temperature=0.1, repetition_penalty=1.1)

訓練

數據

在公開可用的尼日利亞電影、播客（使用字幕 - 音頻對）以及Huggingface上的開源尼日利亞相關音頻數據集上進行訓練。

預處理

音頻文件經過預處理並重新採樣到24Khz，然後使用 wavtokenizer 進行標記化。

訓練超參數

訓練輪數：5
批量大小：4
調度器：線性調度，前4個輪次進行熱身，最後一個輪次線性衰減至零
優化器：AdamW (betas=(0.9, 0.95), weight_decay=0.01)
學習率：1*10^-3

硬件

GPU：1個A100（Google Colab：50小時）

軟件

訓練框架：Pytorch

未來改進方向

擴大模型規模並增加人工標註/審核的訓練數據。
將模型封裝為API端點。
增加對當地尼日利亞語言的支持。
實現語音克隆。
有可能擴展為語音到語音的助手模型。

📄 許可證

該項目採用Apache 2.0許可證。

📚 引用

BibTeX

@misc{yarngpt2025,
  author = {Saheed Azeez},
  title = {YarnGPT: Nigerian-Accented English Text-to-Speech Model},
  year = {2025},
  publisher = {Hugging Face},
  url = {https://huggingface.co/SaheedAzeez/yarngpt}
}

APA

Saheed Azeez. (2025). YarnGPT: Nigerian-Accented English Text-to-Speech Model. Hugging Face. Available at: https://huggingface.co/saheedniyi/YarnGPT

🔗 參考資料

精選推薦AI模型

Llama 3 Typhoon V1.5x 8b Instruct

專為泰語設計的80億參數指令模型，性能媲美GPT-3.5-turbo，優化了應用場景、檢索增強生成、受限生成和推理任務

Cadet-Tiny是一個基於SODA數據集訓練的超小型對話模型，專為邊緣設備推理設計，體積僅為Cosmo-3B模型的2%左右。

Roberta Base Chinese Extractive Qa

基於RoBERTa架構的中文抽取式問答模型，適用於從給定文本中提取答案的任務。

智啟未來，您的人工智能解決方案智庫