YarnGPT-localオープンソーステキスト読み上げモデル - 三言語対応、高品質な音声合成を提供

ホーム

Yarngpt Local

saheedniyiによって開発

YarnGPTローカル版はヨルバ語、イボ語、ハウサ語向けに設計されたテキスト音声変換モデルで、純粋な言語モデリング技術を用いて、高品質で自然かつ文化的背景に適した音声合成を提供します。

音声合成

Transformers

その他#ナイジェリア方言TTS #マルチボイス合成 #純粋な言語モデリング

ダウンロード数 20

リリース時間 : 1/15/2025

モデル概要

このモデルはナイジェリアの主要言語（ヨルバ語、イボ語、ハウサ語）の音声合成に特化しており、外部アダプタや複雑なアーキテクチャを必要とせず、多様なアプリケーションシナリオに適しています。

モデル特徴

多言語サポート

ナイジェリアの3大主要言語（ヨルバ語、イボ語、ハウサ語）向けに最適化

多様な音声スタイル

10種類の異なる音声スタイル（男性声と女性声を含む）をサポート

純粋な言語モデリング

外部アダプタや複雑なアーキテクチャなしで高品質な音声合成を実現

文化的適応性

生成される音声は自然で現地の文化的背景に適合

モデル能力

テキスト音声合成

多言語音声生成

音声スタイル制御

使用事例

ニュースリーディング

現地語ニュース放送

ニューステキストをヨルバ語、イボ語またはハウサ語の音声に変換

自然で流暢なニュース放送音声を生成

教育アプリケーション

言語学習支援

言語学習者向けに標準的な発音例を提供

🚀 YarnGPT-local

YarnGPT-localは、外部アダプタや複雑なアーキテクチャを使用せず、純粋な言語モデリングを活用して、ヨルバ語、イボ語、ハウサ語の音声合成を行うテキスト・トゥ・スピーチ（TTS）モデルです。様々なアプリケーションに対して、高品質で自然な、文化的に関連性のある音声合成を提供します。

モデル概要

YarnGPT-localは、外部アダプタや複雑なアーキテクチャを用いず、純粋な言語モデリングを利用して、ヨルバ語、イボ語、ハウサ語の音声合成を行うテキスト・トゥ・スピーチ（TTS）モデルです。多様なアプリケーションに対して、高品質で自然で、文化的に関連性のある音声合成を提供します。

使い方（Google Colab上で）

このモデルは独自に音声を生成することができますが、モデルにプロンプトを与えるために音声を使用することをおすすめします。デフォルトで約10種類の音声がサポートされています。

hausa_female1
hausa_female2
hausa_male1
hausa_male2
igbo_female1
igbo_female2
igbo_male2
yoruba_female1
yoruba_female2
yoruba_male2

YarnGPT-localにプロンプトを与える

# clone the YarnGPT repo to get access to the `audiotokenizer`
!git clone https://github.com/saheedniyi02/yarngpt.git


# install some necessary libraries
!pip install outetts==0.2.3 uroman

#import some important packages 
import os
import re
import json
import torch
import inflect
import random
import uroman as ur
import numpy as np
import torchaudio
import IPython
from transformers import AutoModelForCausalLM, AutoTokenizer
from outetts.wav_tokenizer.decoder import WavTokenizer
from yarngpt.audiotokenizer import AudioTokenizerForLocal


# download the wavtokenizer weights and config (to encode and decode the audio)
!wget https://huggingface.co/novateur/WavTokenizer-medium-speech-75token/resolve/main/wavtokenizer_mediumdata_frame75_3s_nq1_code4096_dim512_kmeans200_attn.yaml
!gdown 1-ASeEkrn4HY49yZWHTASgfGFNXdVnLTt

# model path and wavtokenizer weight path (the paths are assumed based on Google colab, a different environment might save the weights to a different location).
hf_path="saheedniyi/YarnGPT-local"
wav_tokenizer_config_path="/content/wavtokenizer_mediumdata_frame75_3s_nq1_code4096_dim512_kmeans200_attn.yaml"
wav_tokenizer_model_path = "/content/wavtokenizer_large_speech_320_24k.ckpt"

# create the AudioTokenizer object 
audio_tokenizer=AudioTokenizerForLocal(
    hf_path,wav_tokenizer_model_path,wav_tokenizer_config_path
)

#load the model weights

model = AutoModelForCausalLM.from_pretrained(hf_path,torch_dtype="auto").to(audio_tokenizer.device)

# your input text
text="Ẹ maa rii pe lati bi ọsẹ meloo kan ni ijiroro ti wa lati ọdọ awọn ileeṣẹ wọnyi wi pe wọn fẹẹ ṣafikun si owo ipe pẹlu ida ọgọrun-un."

# creating a prompt, when creating a prompt, there is an optional `speaker_name` parameter
prompt=audio_tokenizer.create_prompt(text,"yoruba","yoruba_male2")

# tokenize the prompt
input_ids=audio_tokenizer.tokenize_prompt(prompt)

# generate output from the model, you can tune the `.generate` parameters as you wish
output  = model.generate(
            input_ids=input_ids,
            temperature=0.1,
            repetition_penalty=1.1,
            num_beams=4,
            max_length=4000,
        )

# convert the output to "audio codes"
codes=audio_tokenizer.get_codes(output)

# converts the codes to audio 
audio=audio_tokenizer.get_audio(codes)

# play the audio
IPython.display.Audio(audio,rate=24000)

# save the audio 
torchaudio.save(f"audio.wav", audio, sample_rate=24000)

地域言語用のシンプルなニュースリーダー

# clone the YarnGPT repo to get access to the `audiotokenizer`
!git clone https://github.com/saheedniyi02/yarngpt.git


# install some necessary libraries
!pip install outetts uroman trafilatura pydub


#import important packages
import os
import re
import json
import torch
import inflect
import random
import requests
import trafilatura
import inflect
import uroman as ur
import numpy as np
import torchaudio
import IPython
from pydub import AudioSegment
from pydub.effects import normalize
from transformers import AutoModelForCausalLM, AutoTokenizer
from outetts.wav_tokenizer.decoder import WavTokenizer
from yarngpt.audiotokenizer import AudioTokenizer,AudioTokenizerForLocal

# download the `WavTokenizer` files
!wget https://huggingface.co/novateur/WavTokenizer-medium-speech-75token/resolve/main/wavtokenizer_mediumdata_frame75_3s_nq1_code4096_dim512_kmeans200_attn.yaml
!gdown 1-ASeEkrn4HY49yZWHTASgfGFNXdVnLTt

tokenizer_path="saheedniyi/YarnGPT-local"
wav_tokenizer_config_path="/content/wavtokenizer_mediumdata_frame75_3s_nq1_code4096_dim512_kmeans200_attn.yaml"
wav_tokenizer_model_path = "/content/wavtokenizer_large_speech_320_24k.ckpt"


audio_tokenizer=AudioTokenizerForLocal(
    tokenizer_path,wav_tokenizer_model_path,wav_tokenizer_config_path
       )

model = AutoModelForCausalLM.from_pretrained(tokenizer_path,torch_dtype="auto").to(audio_tokenizer.device)

# Split text into chunks
def split_text_into_chunks(text, word_limit=25):
  sentences=[sentence.strip() for sentence in text.split('.') if sentence.strip()]
  chunks=[]
  for sentence in sentences:
    chunks.append(".")
    sentence_splitted=sentence.split(" ")
    num_words=len(sentence_splitted)
    start_index=0
    if num_words>word_limit:
      while start_index<num_words:
        end_index=min(num_words,start_index+word_limit)
        chunks.append(" ".join(sentence_splitted[start_index:start_index+word_limit]))
        start_index=end_index
    else:
      chunks.append(sentence)
  return chunks

# reduce the speed of the audio, results from the local languages are always fast
def speed_change(sound, speed=0.9):
    # Manually override the frame_rate. This tells the computer how many
    # samples to play per second
    sound_with_altered_frame_rate = sound._spawn(sound.raw_data, overrides={
         "frame_rate": int(sound.frame_rate * speed)
      })
     # convert the sound with altered frame rate to a standard frame rate
     # so that regular playback programs will work right. They often only
     # know how to play audio at standard frame rate (like 44.1k)
    return sound_with_altered_frame_rate.set_frame_rate(sound.frame_rate)


page=requests.get("https://alaroye.org/a-maa-too-fo-ipinle-ogun-mo-omo-egbe-okunkun-meje-lowo-ti-te-bayii-omolola/")
content=trafilatura.extract(page.text)
chunks=split_text_into_chunks(content)


all_codes=[]
for i,chunk in enumerate(chunks):
  print(i)
  print("\n")
  print(chunk)
  if chunk==".":
    #add silence for 0.5 seconds if we encounter a full stop
    all_codes.extend([453]*38)
  else:
    prompt=audio_tokenizer.create_prompt(chunk,lang="yoruba",speaker_name="yoruba_female2")
    input_ids=audio_tokenizer.tokenize_prompt(prompt)
    output  = model.generate(
            input_ids=input_ids,
            temperature=0.1,
            repetition_penalty=1.1,
            max_length=4000,
            num_beams=5,
        )
    codes=audio_tokenizer.get_codes(output)
    all_codes.extend(codes)


audio=audio_tokenizer.get_audio(all_codes)

#display the output
IPython.display.Audio(audio,rate=24000)

#save audio
torchaudio.save(f"news1.wav", audio, sample_rate=24000)

#convert file to an `AudioSegment` object for furher processing
audio_dub=AudioSegment.from_file("news1.wav")

# reduce audio speed: it reduces quality also
speed_change(audio_dub,0.9)

モデル説明

開発者: Saheedniyi
モデルタイプ: テキスト・トゥ・スピーチ
言語 (NLP): イボ語、ヨルバ語、ハウサ語 → 音声
ファインチューニング元: HuggingFaceTB/SmolLM2-360M
リポジトリ: YarnGPT Github Repository
論文: 作成中
デモ: 1) Prompt YarnGPT-local notebook 2) Simple news reader: YarnGPT-local

用途

実験目的でヨルバ語、イボ語、ハウサ語の音声を生成します。

想定外の用途

このモデルは、ヨルバ語、イボ語、ハウサ語以外の言語の音声生成には適していません。

バイアス、リスク、制限事項

このモデルはナイジェリアのアクセントの全ての多様性を捉えられない可能性があり、トレーニングデータセットに基づくバイアスを示すことがあります。
モデルが生成する音声は時々非常に速く、何らかの後処理が必要になることがあります。
モデルは「イントネーション」を考慮していないため、時々単語の誤発音につながることがあります。
モデルは一部のプロンプトに反応しないことがあります。

推奨事項

ユーザー（直接ユーザーと下流ユーザーの両方）は、このモデルのリスク、バイアス、制限事項を認識する必要があります。フィードバックと多様なトレーニングデータの提供が奨励されます。

音声サンプル

YarnGPTが生成したサンプルを聞いてみましょう。

入力	音声	注
Ẹ maa rii pe lati bi ọsẹ meloo kan ni ijiroro ti wa lati ọdọ awọn ileeṣẹ wọnyi wi pe wọn fẹẹ ṣafikun si owo ipe pẹlu ida ọgọrun-un		(temperature=0.1, repetition_penalty=1.1,num_beams=4), 音声: yoruba_male2
Iwadii fihan pe ọkan lara awọn eeyan meji yii lo ṣee si ja sinu tanki epo disu naa lasiko to n ṣiṣẹ lọwọ.		(temperature=0.1, repetition_penalty=1.1,num_beams=4), 音声: yoruba_female1
Shirun da gwamnati mai ci yanzu ta yi wajen kin bayani a akan halin da ake ciki a game da batun kidayar shi ne ya janyo wannan zargi da jam'iyyar ta Labour ta yi.		(temperature=0.1, repetition_penalty=1.1,num_beams=4), 音声: hausa_male2
A lokuta da dama yakan fito a matsayin jarumin da ke taimaka wa babban jarumi, kodayake a wasu fina-finan yakan fito a matsayin babban jarumi.		(temperature=0.1, repetition_penalty=1.1,num_beams=4), 音声: hausa_female1
Amụma ndị ọzọ o buru gụnyere inweta ihe zuru oke, ịmụta ụmụaka nye ndị na-achọ nwa		(temperature=0.1, repetition_penalty=1.1,num_beams=4), 音声: igbo_female1

トレーニング

データ

ヨルバ語、イボ語、ハウサ語のオープンソースデータセットでトレーニングされています。

前処理

音声ファイルは前処理され、24KHzにリサンプリングされ、wavtokenizerを使用してトークン化されました。

トレーニングハイパーパラメータ

エポック数: 5
バッチサイズ: 4
スケジューラ: 4エポックのウォームアップを伴う線形スケジュール、その後最後のエポックでゼロまで線形減衰
オプティマイザ: AdamW (betas=(0.9, 0.95),weight_decay=0.01)
学習率: 1*10^-3

ハードウェア

GPU: 1台のA100（Google Colab: 30時間）

ソフトウェア

トレーニングフレームワーク: Pytorch

将来的な改善点

モデルサイズとトレーニングデータの拡大
モデルをAPIエンドポイントでラップする
音声クローニング
音声・トゥ・音声アシスタントモデルへの拡張の可能性

引用 [任意]

BibTeX:

@misc{yarngpt2025,
  author = {Saheed Azeez},
  title = {YarnGPT: Nigerian-Accented English Text-to-Speech Model},
  year = {2025},
  publisher = {Hugging Face},
  url = {https://huggingface.co/SaheedAzeez/yarngpt}
}

APA:

Saheed Azeez. (2025). YarnGPT-local: Nigerian languages Text-to-Speech Model. Hugging Face. Available at: https://huggingface.co/saheedniyi/YarnGPT-local