YarnGPT开源文本转语音模型 - 免费合成尼日利亚口音英语，用于多样应用

首页

Yarngpt

由 saheedniyi 开发

YarnGPT 是一款专为合成尼日利亚口音英语而设计的文本转语音（TTS）模型，采用纯语言建模技术，可为多样化应用提供高质量、自然且文化相关的语音合成。

语音合成

Transformers

英语开源协议:Apache-2.0 #尼日利亚口音合成 #纯语言建模TTS #文化相关语音生成

下载量 124

发布时间 : 12/31/2024

模型简介

YarnGPT 是一款基于语言建模的文本转语音模型，专门设计用于生成尼日利亚口音的英语语音。它无需外部适配器或复杂架构，支持约11种不同声音（6男5女）。

模型特点

尼日利亚口音支持

专门优化用于生成自然流畅的尼日利亚口音英语语音

多声音支持

提供11种不同声音选择（6男5女），满足多样化需求

纯语言建模技术

无需外部适配器或复杂架构，简化部署流程

文化相关性

语音输出具有文化相关性，适合尼日利亚本地应用场景

模型能力

文本转语音

尼日利亚口音英语合成

多声音选择

长文本语音合成

使用案例

媒体与娱乐

新闻播报

生成具有尼日利亚口音的新闻播报语音

自然流畅的新闻播报效果

教育

教育内容朗读

为教育内容生成本地化语音

提高尼日利亚学生的学习体验

客户服务

IVR系统语音

为尼日利亚地区的客户服务系统生成语音提示

更亲切的本地化客户体验

🚀 YarnGPT

YarnGPT是一个文本转语音（TTS）模型，它借助纯语言建模技术，无需外部适配器或复杂架构，就能合成带有尼日利亚口音的英语语音。该模型可为各类应用提供高质量、自然且具文化相关性的语音合成服务。

🚀 快速开始

模型使用说明

模型本身可以生成音频，但使用语音来提示模型效果更佳。默认支持约11种语音（6男5女）：

zainab
jude
tayo
remi
idera（默认且效果最佳的语音）
regina
chinenye
umar
osagie
joke
emma（这些名字与任何部落或口音无关）

提示YarnGPT

# clone the YarnGPT repo to get access to the `audiotokenizer`
!git clone https://github.com/saheedniyi02/yarngpt.git


# install some necessary libraries
!pip install outetts==0.2.3 uroman

#import some important packages 
import os
import re
import json
import torch
import inflect
import random
import uroman as ur
import numpy as np
import torchaudio
import IPython
from transformers import AutoModelForCausalLM, AutoTokenizer
from outetts.wav_tokenizer.decoder import WavTokenizer
from yarngpt.audiotokenizer import AudioTokenizer


# download the wavtokenizer weights and config (to encode and decode the audio)
!wget https://huggingface.co/novateur/WavTokenizer-medium-speech-75token/resolve/main/wavtokenizer_mediumdata_frame75_3s_nq1_code4096_dim512_kmeans200_attn.yaml
!gdown 1-ASeEkrn4HY49yZWHTASgfGFNXdVnLTt

# model path and wavtokenizer weight path (the paths are assumed based on Google colab, a different environment might save the weights to a different location).
hf_path="saheedniyi/YarnGPT"
wav_tokenizer_config_path="/content/wavtokenizer_mediumdata_frame75_3s_nq1_code4096_dim512_kmeans200_attn.yaml"
wav_tokenizer_model_path = "/content/wavtokenizer_large_speech_320_24k.ckpt"

# create the AudioTokenizer object 
audio_tokenizer=AudioTokenizer(
    hf_path,wav_tokenizer_model_path,wav_tokenizer_config_path
)

#load the model weights

model = AutoModelForCausalLM.from_pretrained(hf_path,torch_dtype="auto").to(audio_tokenizer.device)

# your input text
text="Uhm, so, what was the inspiration behind your latest project? Like, was there a specific moment where you were like, 'Yeah, this is it!' Or, you know, did it just kind of, uh, come together naturally over time?"

# creating a prompt, when creating a prompt, there is an optional `speaker_name` parameter, the possible speakers are "idera","emma","jude","osagie","tayo","zainab","joke","regina","remi","umar","chinenye" if no speaker is selected a speaker is chosen at random 
prompt=audio_tokenizer.create_prompt(text,"idera")

# tokenize the prompt
input_ids=audio_tokenizer.tokenize_prompt(prompt)

# generate output from the model, you can tune the `.generate` parameters as you wish
output  = model.generate(
            input_ids=input_ids,
            temperature=0.1,
            repetition_penalty=1.1,
            max_length=4000,
        )

# convert the output to "audio codes"
codes=audio_tokenizer.get_codes(output)

# converts the codes to audio 
audio=audio_tokenizer.get_audio(codes)

# play the audio
IPython.display.Audio(audio,rate=24000)

# save the audio 
torchaudio.save(f"audio.wav", audio, sample_rate=24000)

简单的尼日利亚口音新闻阅读器

!git clone https://github.com/saheedniyi02/yarngpt.git

# install some necessary libraries
!pip install outetts uroman trafilatura pydub

import os
import re
import json
import torch
import inflect
import random
import requests
import trafilatura
import inflect
import uroman as ur
import numpy as np
import torchaudio
import IPython
from pydub import AudioSegment
from pydub.effects import normalize
from transformers import AutoModelForCausalLM, AutoTokenizer
from outetts.wav_tokenizer.decoder import WavTokenizer


!wget https://huggingface.co/novateur/WavTokenizer-medium-speech-75token/resolve/main/wavtokenizer_mediumdata_frame75_3s_nq1_code4096_dim512_kmeans200_attn.yaml
!gdown 1-ASeEkrn4HY49yZWHTASgfGFNXdVnLTt

from yarngpt.audiotokenizer import AudioTokenizer

tokenizer_path="saheedniyi/YarnGPT"
wav_tokenizer_config_path="/content/wavtokenizer_mediumdata_frame75_3s_nq1_code4096_dim512_kmeans200_attn.yaml"
wav_tokenizer_model_path = "/content/wavtokenizer_large_speech_320_24k.ckpt"



audio_tokenizer=AudioTokenizer(
    tokenizer_path,wav_tokenizer_model_path,wav_tokenizer_config_path
       )


model = AutoModelForCausalLM.from_pretrained(tokenizer_path,torch_dtype="auto").to(audio_tokenizer.device)


def split_text_into_chunks(text, word_limit=25):
  """ 
  Function to split a long web page into reasonable chunks
  """
  sentences=[sentence.strip() for sentence in text.split('.') if sentence.strip()]
  chunks=[]
  for sentence in sentences:
    chunks.append(".")
    sentence_splitted=sentence.split(" ")
    num_words=len(sentence_splitted)
    start_index=0
    if num_words>word_limit:
      while start_index<num_words:
        end_index=min(num_words,start_index+word_limit)
        chunks.append(" ".join(sentence_splitted[start_index:start_index+word_limit]))
        start_index=end_index
    else:
      chunks.append(sentence)
  return chunks

#Extracting the content of a webpage
page=requests.get("https://punchng.com/expensive-feud-how-burna-boy-cubana-chief-priests-fight-led-to-dollar-rain/")
content=trafilatura.extract(page.text)
chunks=split_text_into_chunks(content)

#Looping over the chunks and adding creating a large `all_codes` list
all_codes=[]
for i,chunk in enumerate(chunks):
  print(i)
  print("\n")
  print(chunk)
  if chunk==".":
    #add silence for 0.25 seconds if we encounter a full stop
    all_codes.extend([453]*20)
  else:
    prompt=audio_tokenizer.create_prompt(chunk,"chinenye")
    input_ids=audio_tokenizer.tokenize_prompt(prompt)
    output  = model.generate(
            input_ids=input_ids,
            temperature=0.1,
            repetition_penalty=1.1,
            max_length=4000,
        )
    codes=audio_tokenizer.get_codes(output)
    all_codes.extend(codes)


# Converting to audio
audio=audio_tokenizer.get_audio(all_codes)
IPython.display.Audio(audio,rate=24000)
torchaudio.save(f"news1.wav", audio, sample_rate=24000)

✨ 主要特性

基于纯语言建模，无需外部适配器或复杂架构。
能够合成高质量、自然且具文化相关性的尼日利亚口音英语语音。
默认支持11种不同的语音。

📦 安装指南

在代码示例中，有相关的安装命令，如：

!git clone https://github.com/saheedniyi02/yarngpt.git
!pip install outetts==0.2.3 uroman
!pip install outetts uroman trafilatura pydub

📚 详细文档

模型描述

开发者：Saheedniyi
模型类型：文本转语音
语言（NLP）：英语 --> 尼日利亚口音英语
微调基础模型：HuggingFaceTB/SmolLM2 - 360M
仓库地址：YarnGPT Github Repository
论文：正在撰写中。
演示：
1. Prompt YarnGPT notebook
2. Simple news reader

用途

用于实验性地生成尼日利亚口音的英语语音。

不适用场景

该模型不适用于生成英语以外的语言或其他口音的语音。

🔧 技术细节

偏差、风险和局限性

该模型可能无法涵盖尼日利亚口音的全部多样性，并且可能会基于训练数据集表现出偏差。此外，模型训练所使用的大量文本是自动生成的，这可能会影响其性能。

建议

用户（包括直接用户和下游用户）应该了解该模型的风险、偏差和局限性。鼓励提供反馈并贡献多样化的训练数据。

语音样本

可以收听YarnGPT生成的样本：

输入	音频	备注
Uhm, so, what was the inspiration behind your latest project? Like, was there a specific moment where you were like, 'Yeah, this is it!' Or, you know, did it just kind of, uh, come together naturally over time?		(temperature=0.1, repetition_penalty=1.1), voice: idera
Wizkid, Davido, Burna Boy perform at same event in Lagos. This event has sparked many reactions across social media, with fans and critics alike praising the artistes' performances and the rare opportunity to see the three music giants on the same stage.		(temperature=0.1, repetition_penalty=1.1), voice: jude
Since Nigeria became a republic in 1963, 14 individuals have served as head of state of Nigeria under different titles. The incumbent president Bola Tinubu is the nation's 16th head of state.		(temperature=0.1, repetition_penalty=1.1), voice: zainab, 模型在发音 `in 1963` 时存在困难
I visited the President, who has shown great concern for the security of Plateau State, especially considering that just a year ago, our state was in mourning. The President’s commitment to addressing these challenges has been steadfast.		(temperature=0.1, repetition_penalty=1.1), voice: emma
Scientists have discovered a new planet that may be capable of supporting life!		(temperature=0.1, repetition_penalty=1.1)

训练

数据

在公开可用的尼日利亚电影、播客（使用字幕 - 音频对）以及Huggingface上的开源尼日利亚相关音频数据集上进行训练。

预处理

音频文件经过预处理并重新采样到24Khz，然后使用 wavtokenizer 进行标记化。

训练超参数

训练轮数：5
批量大小：4
调度器：线性调度，前4个轮次进行热身，最后一个轮次线性衰减至零
优化器：AdamW (betas=(0.9, 0.95), weight_decay=0.01)
学习率：1*10^-3

硬件

GPU：1个A100（Google Colab：50小时）

软件

训练框架：Pytorch

未来改进方向

扩大模型规模并增加人工标注/审核的训练数据。
将模型封装为API端点。
增加对当地尼日利亚语言的支持。
实现语音克隆。
有可能扩展为语音到语音的助手模型。

📄 许可证

该项目采用Apache 2.0许可证。

📚 引用

BibTeX

@misc{yarngpt2025,
  author = {Saheed Azeez},
  title = {YarnGPT: Nigerian-Accented English Text-to-Speech Model},
  year = {2025},
  publisher = {Hugging Face},
  url = {https://huggingface.co/SaheedAzeez/yarngpt}
}

APA

Saheed Azeez. (2025). YarnGPT: Nigerian-Accented English Text-to-Speech Model. Hugging Face. Available at: https://huggingface.co/saheedniyi/YarnGPT

🔗 参考资料

精选推荐AI模型

Llama 3 Typhoon V1.5x 8b Instruct

专为泰语设计的80亿参数指令模型，性能媲美GPT-3.5-turbo，优化了应用场景、检索增强生成、受限生成和推理任务

Cadet-Tiny是一个基于SODA数据集训练的超小型对话模型，专为边缘设备推理设计，体积仅为Cosmo-3B模型的2%左右。

Roberta Base Chinese Extractive Qa

基于RoBERTa架构的中文抽取式问答模型，适用于从给定文本中提取答案的任务。

智启未来，您的人工智能解决方案智库