YarnGPT-local开源文本转语音模型 - 支持三语，提供高质量语音合成

首页

Yarngpt Local

由 saheedniyi 开发

YarnGPT本地版是一款专为约鲁巴语、伊博语和豪萨语设计的文本转语音模型，采用纯语言建模技术，提供高质量、自然且符合文化背景的语音合成。

语音合成

Transformers

其他#尼日利亚方言TTS #多语音角色合成 #纯语言建模

下载量 20

发布时间 : 1/15/2025

模型简介

该模型专注于合成尼日利亚主要语言（约鲁巴语、伊博语和豪萨语）的语音，无需外部适配器或复杂架构，适用于多样化应用场景。

模型特点

多语言支持

专门针对尼日利亚三大主要语言（约鲁巴语、伊博语和豪萨语）优化

多样化语音风格

支持10种不同语音风格（包括男声和女声）

纯语言建模

无需外部适配器或复杂架构即可实现高质量语音合成

文化适应性

生成的语音自然且符合当地文化背景

模型能力

文本转语音合成

多语言语音生成

语音风格控制

使用案例

新闻阅读

本地语言新闻播报

将新闻文本转换为约鲁巴语、伊博语或豪萨语语音

生成自然流畅的新闻播报语音

教育应用

语言学习辅助

为语言学习者提供标准发音示例

🚀 YarnGPT-local

YarnGPT-local 是一款文本转语音（TTS）模型，它利用纯语言建模，无需外部适配器或复杂架构，能够合成约鲁巴语、伊博语和豪萨语语音。该模型可为各种应用提供高质量、自然且具有文化相关性的语音合成服务。

🚀 快速开始

模型概述

YarnGPT-local 是一个文本转语音（TTS）模型，旨在合成约鲁巴语、伊博语和豪萨语语音。它利用纯语言建模，无需外部适配器或复杂架构，为各种应用提供高质量、自然且具有文化相关性的语音合成。

如何使用（在 Google Colab 上）

该模型可以自行生成音频，但最好使用语音来提示模型。默认支持约 10 种语音：

hausa_female1
hausa_female2
hausa_male1
hausa_male2
igbo_female1
igbo_female2
igbo_male2
yoruba_female1
yoruba_female2
yoruba_male2

提示 YarnGPT-local

# clone the YarnGPT repo to get access to the `audiotokenizer`
!git clone https://github.com/saheedniyi02/yarngpt.git


# install some necessary libraries
!pip install outetts==0.2.3 uroman

#import some important packages 
import os
import re
import json
import torch
import inflect
import random
import uroman as ur
import numpy as np
import torchaudio
import IPython
from transformers import AutoModelForCausalLM, AutoTokenizer
from outetts.wav_tokenizer.decoder import WavTokenizer
from yarngpt.audiotokenizer import AudioTokenizerForLocal


# download the wavtokenizer weights and config (to encode and decode the audio)
!wget https://huggingface.co/novateur/WavTokenizer-medium-speech-75token/resolve/main/wavtokenizer_mediumdata_frame75_3s_nq1_code4096_dim512_kmeans200_attn.yaml
!gdown 1-ASeEkrn4HY49yZWHTASgfGFNXdVnLTt

# model path and wavtokenizer weight path (the paths are assumed based on Google colab, a different environment might save the weights to a different location).
hf_path="saheedniyi/YarnGPT-local"
wav_tokenizer_config_path="/content/wavtokenizer_mediumdata_frame75_3s_nq1_code4096_dim512_kmeans200_attn.yaml"
wav_tokenizer_model_path = "/content/wavtokenizer_large_speech_320_24k.ckpt"

# create the AudioTokenizer object 
audio_tokenizer=AudioTokenizerForLocal(
    hf_path,wav_tokenizer_model_path,wav_tokenizer_config_path
)

#load the model weights

model = AutoModelForCausalLM.from_pretrained(hf_path,torch_dtype="auto").to(audio_tokenizer.device)

# your input text
text="Ẹ maa rii pe lati bi ọsẹ meloo kan ni ijiroro ti wa lati ọdọ awọn ileeṣẹ wọnyi wi pe wọn fẹẹ ṣafikun si owo ipe pẹlu ida ọgọrun-un."

# creating a prompt, when creating a prompt, there is an optional `speaker_name` parameter
prompt=audio_tokenizer.create_prompt(text,"yoruba","yoruba_male2")

# tokenize the prompt
input_ids=audio_tokenizer.tokenize_prompt(prompt)

# generate output from the model, you can tune the `.generate` parameters as you wish
output  = model.generate(
            input_ids=input_ids,
            temperature=0.1,
            repetition_penalty=1.1,
            num_beams=4,
            max_length=4000,
        )

# convert the output to "audio codes"
codes=audio_tokenizer.get_codes(output)

# converts the codes to audio 
audio=audio_tokenizer.get_audio(codes)

# play the audio
IPython.display.Audio(audio,rate=24000)

# save the audio 
torchaudio.save(f"audio.wav", audio, sample_rate=24000)

本地语言简易新闻阅读器

# clone the YarnGPT repo to get access to the `audiotokenizer`
!git clone https://github.com/saheedniyi02/yarngpt.git


# install some necessary libraries
!pip install outetts uroman trafilatura pydub


#import important packages
import os
import re
import json
import torch
import inflect
import random
import requests
import trafilatura
import inflect
import uroman as ur
import numpy as np
import torchaudio
import IPython
from pydub import AudioSegment
from pydub.effects import normalize
from transformers import AutoModelForCausalLM, AutoTokenizer
from outetts.wav_tokenizer.decoder import WavTokenizer
from yarngpt.audiotokenizer import AudioTokenizer,AudioTokenizerForLocal

# download the `WavTokenizer` files
!wget https://huggingface.co/novateur/WavTokenizer-medium-speech-75token/resolve/main/wavtokenizer_mediumdata_frame75_3s_nq1_code4096_dim512_kmeans200_attn.yaml
!gdown 1-ASeEkrn4HY49yZWHTASgfGFNXdVnLTt

tokenizer_path="saheedniyi/YarnGPT-local"
wav_tokenizer_config_path="/content/wavtokenizer_mediumdata_frame75_3s_nq1_code4096_dim512_kmeans200_attn.yaml"
wav_tokenizer_model_path = "/content/wavtokenizer_large_speech_320_24k.ckpt"


audio_tokenizer=AudioTokenizerForLocal(
    tokenizer_path,wav_tokenizer_model_path,wav_tokenizer_config_path
       )

model = AutoModelForCausalLM.from_pretrained(tokenizer_path,torch_dtype="auto").to(audio_tokenizer.device)

# Split text into chunks
def split_text_into_chunks(text, word_limit=25):
  sentences=[sentence.strip() for sentence in text.split('.') if sentence.strip()]
  chunks=[]
  for sentence in sentences:
    chunks.append(".")
    sentence_splitted=sentence.split(" ")
    num_words=len(sentence_splitted)
    start_index=0
    if num_words>word_limit:
      while start_index<num_words:
        end_index=min(num_words,start_index+word_limit)
        chunks.append(" ".join(sentence_splitted[start_index:start_index+word_limit]))
        start_index=end_index
    else:
      chunks.append(sentence)
  return chunks

# reduce the speed of the audio, results from the local languages are always fast
def speed_change(sound, speed=0.9):
    # Manually override the frame_rate. This tells the computer how many
    # samples to play per second
    sound_with_altered_frame_rate = sound._spawn(sound.raw_data, overrides={
         "frame_rate": int(sound.frame_rate * speed)
      })
     # convert the sound with altered frame rate to a standard frame rate
     # so that regular playback programs will work right. They often only
     # know how to play audio at standard frame rate (like 44.1k)
    return sound_with_altered_frame_rate.set_frame_rate(sound.frame_rate)


page=requests.get("https://alaroye.org/a-maa-too-fo-ipinle-ogun-mo-omo-egbe-okunkun-meje-lowo-ti-te-bayii-omolola/")
content=trafilatura.extract(page.text)
chunks=split_text_into_chunks(content)


all_codes=[]
for i,chunk in enumerate(chunks):
  print(i)
  print("\n")
  print(chunk)
  if chunk==".":
    #add silence for 0.5 seconds if we encounter a full stop
    all_codes.extend([453]*38)
  else:
    prompt=audio_tokenizer.create_prompt(chunk,lang="yoruba",speaker_name="yoruba_female2")
    input_ids=audio_tokenizer.tokenize_prompt(prompt)
    output  = model.generate(
            input_ids=input_ids,
            temperature=0.1,
            repetition_penalty=1.1,
            max_length=4000,
            num_beams=5,
        )
    codes=audio_tokenizer.get_codes(output)
    all_codes.extend(codes)


audio=audio_tokenizer.get_audio(all_codes)

#display the output
IPython.display.Audio(audio,rate=24000)

#save audio
torchaudio.save(f"news1.wav", audio, sample_rate=24000)

#convert file to an `AudioSegment` object for furher processing
audio_dub=AudioSegment.from_file("news1.wav")

# reduce audio speed: it reduces quality also
speed_change(audio_dub,0.9)

✨ 主要特性

支持约鲁巴语、伊博语和豪萨语三种语言的语音合成。
利用纯语言建模，无需外部适配器或复杂架构。
提供多种语音选择。

📚 详细文档

模型描述

开发者：Saheedniyi
模型类型：文本转语音
支持语言：约鲁巴语、伊博语、豪萨语
微调基础模型：HuggingFaceTB/SmolLM2 - 360M
代码仓库：YarnGPT Github Repository
论文：正在撰写中。
演示：
1. 提示 YarnGPT - local 笔记本
2. 简易新闻阅读器：YarnGPT - local

用途

用于实验性地生成约鲁巴语、伊博语和豪萨语语音。

非适用场景

该模型不适用于生成约鲁巴语、伊博语和豪萨语以外的语言语音。

偏差、风险和局限性

模型可能无法涵盖尼日利亚口音的全部多样性，并且可能会基于训练数据集表现出偏差。
模型生成的音频有时速度非常快，可能需要进行一些后处理。
模型不考虑“语调”，有时会导致某些单词发音错误。
模型对某些提示没有响应。

建议

用户（直接用户和下游用户）应该了解模型的风险、偏差和局限性。鼓励提供反馈和多样化的训练数据。

语音样本

聆听 YarnGPT 生成的样本：

输入	音频	备注
Ẹ maa rii pe lati bi ọsẹ meloo kan ni ijiroro ti wa lati ọdọ awọn ileeṣẹ wọnyi wi pe wọn fẹẹ ṣafikun si owo ipe pẹlu ida ọgọrun-un		(temperature=0.1, repetition_penalty=1.1,num_beams=4), 语音: yoruba_male2
Iwadii fihan pe ọkan lara awọn eeyan meji yii lo ṣee si ja sinu tanki epo disu naa lasiko to n ṣiṣẹ lọwọ.		(temperature=0.1, repetition_penalty=1.1,num_beams=4), 语音: yoruba_female1
Shirun da gwamnati mai ci yanzu ta yi wajen kin bayani a akan halin da ake ciki a game da batun kidayar shi ne ya janyo wannan zargi da jam'iyyar ta Labour ta yi.		(temperature=0.1, repetition_penalty=1.1,num_beams=4), 语音: hausa_male2
A lokuta da dama yakan fito a matsayin jarumin da ke taimaka wa babban jarumi, kodayake a wasu fina-finan yakan fito a matsayin babban jarumi.		(temperature=0.1, repetition_penalty=1.1,num_beams=4), 语音: hausa_female1
Amụma ndị ọzọ o buru gụnyere inweta ihe zuru oke, ịmụta ụmụaka nye ndị na-achọ nwa		(temperature=0.1, repetition_penalty=1.1,num_beams=4), 语音: igbo_female1

训练

数据

在约鲁巴语、伊博语和豪萨语的开源数据集上进行训练。

预处理

音频文件经过预处理并重新采样到 24Khz，使用 wavtokenizer 进行分词。

训练超参数

训练轮数：5
批次大小：4
调度器：线性调度，前 4 轮热身，最后一轮线性衰减到零
优化器：AdamW (betas=(0.9, 0.95),weight_decay=0.01)
学习率：1*10^ - 3

硬件

GPU：1 个 A100（Google Colab：30 小时）

软件

训练框架：Pytorch

未来改进方向

扩大模型规模和训练数据。
将模型封装为 API 端点。
实现语音克隆。
有可能扩展为语音到语音的助手模型。

引用（可选）

BibTeX:

@misc{yarngpt2025,
  author = {Saheed Azeez},
  title = {YarnGPT: Nigerian-Accented English Text-to-Speech Model},
  year = {2025},
  publisher = {Hugging Face},
  url = {https://huggingface.co/SaheedAzeez/yarngpt}
}

APA:

Saheed Azeez. (2025). YarnGPT-local: Nigerian languages Text-to-Speech Model. Hugging Face. Available at: https://huggingface.co/saheedniyi/YarnGPT-local