オープンソースのspeecht5_tts - wolofモデル、ヴォロフ語に無料で適合し、高品質のテキスト読み上げを実現

ホーム

Speecht5 Tts Wolof

bilalfayeによって開発

SpeechT5アーキテクチャを微調整したウォロフ語テキスト音声変換(TTS)モデル、ウォロフ語特性に適応したカスタムトークナイザーを使用

音声合成

Safetensors

その他オープンソースライセンス:MIT #ウォロフ語音声合成 #低リソース言語TTS #カスタマイズトークナイザー

ダウンロード数 126

リリース時間 : 1/9/2025

モデル概要

このモデルはMicrosoft SpeechT5をウォロフ語テキスト音声変換タスク向けに微調整したバージョンで、ウォロフ語音声合成能力を提供し、言語の微妙なニュアンスを捉えます。

モデル特徴

ウォロフ語専用トークナイザー

ウォロフ語向けに設計されたカスタムトークナイザーを使用し、言語特性処理を最適化

音声合成最適化

微調整によりウォロフ語特有の音声・構文特徴を捕捉

効率的な生成

ビームサーチや温度制御などのパラメータで生成品質を最適化

モデル能力

ウォロフ語テキスト音声変換

多様なスタイルの音声合成

話者埋め込みサポート

使用事例

音声インターフェース

仮想アシスタント

ウォロフ語ユーザー向け音声インタラクション機能を提供

アクセシビリティサービス

視覚障害ユーザーのためにテキスト内容を音声に変換

教育アプリケーション

言語学習ツール

学習者がウォロフ語発音の参考を得るのを支援

🚀 speecht5_tts-wolof

このモデルは、ウォロフ語のデータセットを使用して、音声合成（TTS）用にSpeechT5をファインチューニングしたバージョンです。ウォロフ語用に設計されたカスタムトークナイザーを使用し、カスタムトークナイザーによって導入された新しい語彙を考慮するためにベースラインモデルの設定を調整しています。このバージョンのSpeechT5は、ウォロフ語に特化した音声合成機能を提供します。

🚀 クイックスタート

このモデルは、音声認識と合成の両方を統一されたフレームワークに統合するSpeechT5アーキテクチャに基づいています。ウォロフ語の独自の語彙を考慮したカスタムトークナイザーと適応した設定を使用して、音声合成（TTS）用にファインチューニングされています。ファインチューニングプロセスは、ウォロフ語のテキストを含むデータセットを使用して行われ、モデルが言語のニュアンスを捉えた音声を合成できるようにしています。

✨ 主な機能

音声合成: このモデルは、ウォロフ語のテキストを自然な音声に変換することができます。ウォロフ語を話すコミュニティ向けの音声インターフェース、バーチャルアシスタント、または音声合成を必要とするアプリケーションに統合することができます。

📦 インストール

必要な依存関係をインストールするには、次のコマンドを実行してください。

!pip install transformers datasets

💻 使用例

基本的な使用法

import torch
from transformers import SpeechT5ForTextToSpeech, SpeechT5Processor
from transformers import SpeechT5HifiGan

def load_speech_model(checkpoint="bilalfaye/speecht5_tts-wolof", vocoder_checkpoint="microsoft/speecht5_hifigan"):
    """
    Load the SpeechT5 model, processor, and vocoder for text-to-speech.
    
    Args:
        checkpoint (str): The model checkpoint for SpeechT5 TTS.
        vocoder_checkpoint (str): The checkpoint for the HiFi-GAN vocoder.
    
    Returns:
        processor: The processor for the model.
        model: The loaded SpeechT5 model.
        vocoder: The loaded HiFi-GAN vocoder.
        device: The device (CPU or GPU) the model is loaded on.
    """
    # Check for GPU availability and set device accordingly
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    
    # Load the SpeechT5 processor and model
    processor = SpeechT5Processor.from_pretrained(checkpoint)
    model = SpeechT5ForTextToSpeech.from_pretrained(checkpoint).to(device)  # Move model to the correct device

    # Load the HiFi-GAN vocoder
    vocoder = SpeechT5HifiGan.from_pretrained(vocoder_checkpoint).to(device)  # Move vocoder to the correct device

    return processor, model, vocoder, device

# Example usage
processor, model, vocoder, device = load_speech_model()

# Verify the device being used
print(f"Model and vocoder loaded on device: {device}")

from datasets import load_dataset
# Load speaker embeddings (this dataset contains speaker-specific embeddings)
embeddings_dataset = load_dataset("Matthijs/cmu-arctic-xvectors", split="validation")
speaker_embedding = torch.tensor(embeddings_dataset[7306]["xvector"]).unsqueeze(0)

import torch
from transformers import SpeechT5ForTextToSpeech, SpeechT5Processor, SpeechT5HifiGan
from IPython.display import Audio, display

def generate_speech_from_text(text, 
                              speaker_embedding=speaker_embedding,
                              processor=processor,
                              model=model,
                              vocoder=vocoder):            
    """
    Generates speech from a given text using SpeechT5 and HiFi-GAN vocoder.

    Args:
        text (str): The input text to be converted to speech.
        checkpoint (str): The model checkpoint for SpeechT5 TTS.
        vocoder_checkpoint (str): The checkpoint for the HiFi-GAN vocoder.
        speaker_embedding (torch.Tensor): The speaker embedding tensor.
        processor (SpeechT5Processor): The processor for the model.
        model (SpeechT5ForTextToSpeech): The loaded SpeechT5 model.
        vocoder (SpeechT5HifiGan): The loaded HiFi-GAN vocoder.

    Returns:
        None
    """
    # Parameters for text-to-speech generation
    max_text_positions = model.config.max_text_positions  # Token limit
    max_length = model.config.max_length * 1.2  # Slightly extended max_length
    min_length = max_length // 3  # Adjust based on max_length
    num_beams = 7  # Use beam search for better quality
    temperature = 0.6  # Reduce temperature for stability

    # Tokenize the input text and move input tensor to the correct device
    inputs = processor(text=text, return_tensors="pt", padding=True, truncation=True, max_length=max_text_positions)
    inputs = {key: value.to(model.device) for key, value in inputs.items()}  # Move inputs to device

    # Generate speech
    speech = model.generate(
        inputs["input_ids"],
        speaker_embeddings=speaker_embedding.to(model.device),  # Ensure speaker_embedding is also on the correct device
        vocoder=vocoder,
        max_length=int(max_length),
        min_length=int(min_length),
        num_beams=num_beams,
        temperature=temperature,
        no_repeat_ngram_size=3,
        repetition_penalty=1.5,
        eos_token_id=None,
        use_cache=True
    )

    # Detach the speech from the computation graph and move it to CPU
    speech = speech.detach().cpu().numpy()

    # Play the generated speech using IPython Audio
    display(Audio(speech, rate=16000))


# Example usage
text = "ñu ne ñoom ñooy nattukaay satélite yi"
generate_speech_from_text(text)

📚 ドキュメント

想定される用途と制限

想定される用途

音声合成: このモデルは、ウォロフ語のテキストを自然な音声に変換するために使用できます。音声インターフェース、バーチャルアシスタント、またはウォロフ語を話すコミュニティ向けの音声合成を必要とするアプリケーションに統合することができます。

制限事項

適用範囲の制限: このモデルはウォロフ語用に特別にファインチューニングされており、他の言語やアクセントでは性能が低下する可能性があります。
データの可用性: このモデルはウォロフ語のデータセットでファインチューニングされていますが、生成される音声の品質は、入力テキストの複雑さやトレーニングに使用されたデータセットによって異なる場合があります。
語彙とトークナイザーの制約: トークナイザーはウォロフ語用に特別にトレーニングされているため、語彙外の単語や未知の文字を効果的に処理できない可能性があります。

トレーニングと評価データ

このモデルは、ウォロフ語のテキストからなるカスタムデータセットでファインチューニングされました。このデータセットは、モデルがウォロフ語の音韻的および構文的特性を正確に反映した音声を生成するように調整するために使用されました。

トレーニング手順

トレーニングハイパーパラメータ

トレーニング中に使用されたハイパーパラメータは次のとおりです。

パラメータ	値
学習率	1e-05
トレーニングバッチサイズ	8
評価バッチサイズ	2
シード	42
勾配累積ステップ	8
総トレーニングバッチサイズ	64
オプティマイザー	Adam (betas=(0.9, 0.999), epsilon=1e-08)
学習率スケジューラータイプ	線形
ウォームアップステップ	500
トレーニングステップ	255000
混合精度トレーニング	ネイティブAMP