YarnGPT-local開源文本轉語音模型 - 支持三語，提供高質量語音合成

首頁

Yarngpt Local

由saheedniyi開發

YarnGPT本地版是一款專為約魯巴語、伊博語和豪薩語設計的文本轉語音模型，採用純語言建模技術，提供高質量、自然且符合文化背景的語音合成。

語音合成

Transformers

其他#尼日利亞方言TTS #多語音角色合成 #純語言建模

下載量 20

發布時間 : 1/15/2025

模型概述

該模型專注於合成尼日利亞主要語言（約魯巴語、伊博語和豪薩語）的語音，無需外部適配器或複雜架構，適用於多樣化應用場景。

模型特點

多語言支持

專門針對尼日利亞三大主要語言（約魯巴語、伊博語和豪薩語）優化

多樣化語音風格

支持10種不同語音風格（包括男聲和女聲）

純語言建模

無需外部適配器或複雜架構即可實現高質量語音合成

文化適應性

生成的語音自然且符合當地文化背景

模型能力

文本轉語音合成

多語言語音生成

語音風格控制

使用案例

新聞閱讀

本地語言新聞播報

將新聞文本轉換為約魯巴語、伊博語或豪薩語語音

生成自然流暢的新聞播報語音

教育應用

語言學習輔助

為語言學習者提供標準發音示例

🚀 YarnGPT-local

YarnGPT-local 是一款文本轉語音（TTS）模型，它利用純語言建模，無需外部適配器或複雜架構，能夠合成約魯巴語、伊博語和豪薩語語音。該模型可為各種應用提供高質量、自然且具有文化相關性的語音合成服務。

🚀 快速開始

模型概述

YarnGPT-local 是一個文本轉語音（TTS）模型，旨在合成約魯巴語、伊博語和豪薩語語音。它利用純語言建模，無需外部適配器或複雜架構，為各種應用提供高質量、自然且具有文化相關性的語音合成。

如何使用（在 Google Colab 上）

該模型可以自行生成音頻，但最好使用語音來提示模型。默認支持約 10 種語音：

hausa_female1
hausa_female2
hausa_male1
hausa_male2
igbo_female1
igbo_female2
igbo_male2
yoruba_female1
yoruba_female2
yoruba_male2

提示 YarnGPT-local

# clone the YarnGPT repo to get access to the `audiotokenizer`
!git clone https://github.com/saheedniyi02/yarngpt.git


# install some necessary libraries
!pip install outetts==0.2.3 uroman

#import some important packages 
import os
import re
import json
import torch
import inflect
import random
import uroman as ur
import numpy as np
import torchaudio
import IPython
from transformers import AutoModelForCausalLM, AutoTokenizer
from outetts.wav_tokenizer.decoder import WavTokenizer
from yarngpt.audiotokenizer import AudioTokenizerForLocal


# download the wavtokenizer weights and config (to encode and decode the audio)
!wget https://huggingface.co/novateur/WavTokenizer-medium-speech-75token/resolve/main/wavtokenizer_mediumdata_frame75_3s_nq1_code4096_dim512_kmeans200_attn.yaml
!gdown 1-ASeEkrn4HY49yZWHTASgfGFNXdVnLTt

# model path and wavtokenizer weight path (the paths are assumed based on Google colab, a different environment might save the weights to a different location).
hf_path="saheedniyi/YarnGPT-local"
wav_tokenizer_config_path="/content/wavtokenizer_mediumdata_frame75_3s_nq1_code4096_dim512_kmeans200_attn.yaml"
wav_tokenizer_model_path = "/content/wavtokenizer_large_speech_320_24k.ckpt"

# create the AudioTokenizer object 
audio_tokenizer=AudioTokenizerForLocal(
    hf_path,wav_tokenizer_model_path,wav_tokenizer_config_path
)

#load the model weights

model = AutoModelForCausalLM.from_pretrained(hf_path,torch_dtype="auto").to(audio_tokenizer.device)

# your input text
text="Ẹ maa rii pe lati bi ọsẹ meloo kan ni ijiroro ti wa lati ọdọ awọn ileeṣẹ wọnyi wi pe wọn fẹẹ ṣafikun si owo ipe pẹlu ida ọgọrun-un."

# creating a prompt, when creating a prompt, there is an optional `speaker_name` parameter
prompt=audio_tokenizer.create_prompt(text,"yoruba","yoruba_male2")

# tokenize the prompt
input_ids=audio_tokenizer.tokenize_prompt(prompt)

# generate output from the model, you can tune the `.generate` parameters as you wish
output  = model.generate(
            input_ids=input_ids,
            temperature=0.1,
            repetition_penalty=1.1,
            num_beams=4,
            max_length=4000,
        )

# convert the output to "audio codes"
codes=audio_tokenizer.get_codes(output)

# converts the codes to audio 
audio=audio_tokenizer.get_audio(codes)

# play the audio
IPython.display.Audio(audio,rate=24000)

# save the audio 
torchaudio.save(f"audio.wav", audio, sample_rate=24000)

本地語言簡易新聞閱讀器

# clone the YarnGPT repo to get access to the `audiotokenizer`
!git clone https://github.com/saheedniyi02/yarngpt.git


# install some necessary libraries
!pip install outetts uroman trafilatura pydub


#import important packages
import os
import re
import json
import torch
import inflect
import random
import requests
import trafilatura
import inflect
import uroman as ur
import numpy as np
import torchaudio
import IPython
from pydub import AudioSegment
from pydub.effects import normalize
from transformers import AutoModelForCausalLM, AutoTokenizer
from outetts.wav_tokenizer.decoder import WavTokenizer
from yarngpt.audiotokenizer import AudioTokenizer,AudioTokenizerForLocal

# download the `WavTokenizer` files
!wget https://huggingface.co/novateur/WavTokenizer-medium-speech-75token/resolve/main/wavtokenizer_mediumdata_frame75_3s_nq1_code4096_dim512_kmeans200_attn.yaml
!gdown 1-ASeEkrn4HY49yZWHTASgfGFNXdVnLTt

tokenizer_path="saheedniyi/YarnGPT-local"
wav_tokenizer_config_path="/content/wavtokenizer_mediumdata_frame75_3s_nq1_code4096_dim512_kmeans200_attn.yaml"
wav_tokenizer_model_path = "/content/wavtokenizer_large_speech_320_24k.ckpt"


audio_tokenizer=AudioTokenizerForLocal(
    tokenizer_path,wav_tokenizer_model_path,wav_tokenizer_config_path
       )

model = AutoModelForCausalLM.from_pretrained(tokenizer_path,torch_dtype="auto").to(audio_tokenizer.device)

# Split text into chunks
def split_text_into_chunks(text, word_limit=25):
  sentences=[sentence.strip() for sentence in text.split('.') if sentence.strip()]
  chunks=[]
  for sentence in sentences:
    chunks.append(".")
    sentence_splitted=sentence.split(" ")
    num_words=len(sentence_splitted)
    start_index=0
    if num_words>word_limit:
      while start_index<num_words:
        end_index=min(num_words,start_index+word_limit)
        chunks.append(" ".join(sentence_splitted[start_index:start_index+word_limit]))
        start_index=end_index
    else:
      chunks.append(sentence)
  return chunks

# reduce the speed of the audio, results from the local languages are always fast
def speed_change(sound, speed=0.9):
    # Manually override the frame_rate. This tells the computer how many
    # samples to play per second
    sound_with_altered_frame_rate = sound._spawn(sound.raw_data, overrides={
         "frame_rate": int(sound.frame_rate * speed)
      })
     # convert the sound with altered frame rate to a standard frame rate
     # so that regular playback programs will work right. They often only
     # know how to play audio at standard frame rate (like 44.1k)
    return sound_with_altered_frame_rate.set_frame_rate(sound.frame_rate)


page=requests.get("https://alaroye.org/a-maa-too-fo-ipinle-ogun-mo-omo-egbe-okunkun-meje-lowo-ti-te-bayii-omolola/")
content=trafilatura.extract(page.text)
chunks=split_text_into_chunks(content)


all_codes=[]
for i,chunk in enumerate(chunks):
  print(i)
  print("\n")
  print(chunk)
  if chunk==".":
    #add silence for 0.5 seconds if we encounter a full stop
    all_codes.extend([453]*38)
  else:
    prompt=audio_tokenizer.create_prompt(chunk,lang="yoruba",speaker_name="yoruba_female2")
    input_ids=audio_tokenizer.tokenize_prompt(prompt)
    output  = model.generate(
            input_ids=input_ids,
            temperature=0.1,
            repetition_penalty=1.1,
            max_length=4000,
            num_beams=5,
        )
    codes=audio_tokenizer.get_codes(output)
    all_codes.extend(codes)


audio=audio_tokenizer.get_audio(all_codes)

#display the output
IPython.display.Audio(audio,rate=24000)

#save audio
torchaudio.save(f"news1.wav", audio, sample_rate=24000)

#convert file to an `AudioSegment` object for furher processing
audio_dub=AudioSegment.from_file("news1.wav")

# reduce audio speed: it reduces quality also
speed_change(audio_dub,0.9)

✨ 主要特性

支持約魯巴語、伊博語和豪薩語三種語言的語音合成。
利用純語言建模，無需外部適配器或複雜架構。
提供多種語音選擇。

📚 詳細文檔

模型描述

開發者：Saheedniyi
模型類型：文本轉語音
支持語言：約魯巴語、伊博語、豪薩語
微調基礎模型：HuggingFaceTB/SmolLM2 - 360M
代碼倉庫：YarnGPT Github Repository
論文：正在撰寫中。
演示：
1. 提示 YarnGPT - local 筆記本
2. 簡易新聞閱讀器：YarnGPT - local

用途

用於實驗性地生成約魯巴語、伊博語和豪薩語語音。

非適用場景

該模型不適用於生成約魯巴語、伊博語和豪薩語以外的語言語音。

偏差、風險和侷限性

模型可能無法涵蓋尼日利亞口音的全部多樣性，並且可能會基於訓練數據集表現出偏差。
模型生成的音頻有時速度非常快，可能需要進行一些後處理。
模型不考慮“語調”，有時會導致某些單詞發音錯誤。
模型對某些提示沒有響應。

建議

用戶（直接用戶和下游用戶）應該瞭解模型的風險、偏差和侷限性。鼓勵提供反饋和多樣化的訓練數據。

語音樣本

聆聽 YarnGPT 生成的樣本：

輸入	音頻	備註
Ẹ maa rii pe lati bi ọsẹ meloo kan ni ijiroro ti wa lati ọdọ awọn ileeṣẹ wọnyi wi pe wọn fẹẹ ṣafikun si owo ipe pẹlu ida ọgọrun-un		(temperature=0.1, repetition_penalty=1.1,num_beams=4), 語音: yoruba_male2
Iwadii fihan pe ọkan lara awọn eeyan meji yii lo ṣee si ja sinu tanki epo disu naa lasiko to n ṣiṣẹ lọwọ.		(temperature=0.1, repetition_penalty=1.1,num_beams=4), 語音: yoruba_female1
Shirun da gwamnati mai ci yanzu ta yi wajen kin bayani a akan halin da ake ciki a game da batun kidayar shi ne ya janyo wannan zargi da jam'iyyar ta Labour ta yi.		(temperature=0.1, repetition_penalty=1.1,num_beams=4), 語音: hausa_male2
A lokuta da dama yakan fito a matsayin jarumin da ke taimaka wa babban jarumi, kodayake a wasu fina-finan yakan fito a matsayin babban jarumi.		(temperature=0.1, repetition_penalty=1.1,num_beams=4), 語音: hausa_female1
Amụma ndị ọzọ o buru gụnyere inweta ihe zuru oke, ịmụta ụmụaka nye ndị na-achọ nwa		(temperature=0.1, repetition_penalty=1.1,num_beams=4), 語音: igbo_female1

訓練

數據

在約魯巴語、伊博語和豪薩語的開源數據集上進行訓練。

預處理

音頻文件經過預處理並重新採樣到 24Khz，使用 wavtokenizer 進行分詞。

訓練超參數

訓練輪數：5
批次大小：4
調度器：線性調度，前 4 輪熱身，最後一輪線性衰減到零
優化器：AdamW (betas=(0.9, 0.95),weight_decay=0.01)
學習率：1*10^ - 3

硬件

GPU：1 個 A100（Google Colab：30 小時）

軟件

訓練框架：Pytorch

未來改進方向

擴大模型規模和訓練數據。
將模型封裝為 API 端點。
實現語音克隆。
有可能擴展為語音到語音的助手模型。

引用（可選）

BibTeX:

@misc{yarngpt2025,
  author = {Saheed Azeez},
  title = {YarnGPT: Nigerian-Accented English Text-to-Speech Model},
  year = {2025},
  publisher = {Hugging Face},
  url = {https://huggingface.co/SaheedAzeez/yarngpt}
}

APA:

Saheed Azeez. (2025). YarnGPT-local: Nigerian languages Text-to-Speech Model. Hugging Face. Available at: https://huggingface.co/saheedniyi/YarnGPT-local