xm_transformer_s2ut_en - HK Open-source Speech Translation Model - Free Direct Speech Conversion from English to Minnan Dialect

Xm Transformer S2ut En Hk

Developed by facebook

Fairseq-developed English-Hokkien (Taiwanese) speech-to-speech translation model, featuring a single-channel decoder architecture that supports direct speech conversion without intermediate text

Speech Synthesis #English-Hokkien Speech Translation #Direct Speech-to-Speech Conversion #TED Domain-Specific

Downloads 31

Release Time : 10/7/2022

Model Overview

This model facilitates direct speech-to-speech translation between English and Hokkien (Taiwanese), utilizing a Transformer architecture combined with speech synthesis technology for end-to-end conversion

Model Features

Direct Speech Conversion

Achieves end-to-end speech-to-speech translation without intermediate text representation

Multi-Data Source Training

Combines supervised data from the TED domain with weakly supervised data from TED and audiobook domains for training

High-Quality Speech Synthesis

Employs the unit_hifigan_HK_layer12 vocoder to generate natural and fluent speech output

Model Capabilities

English-to-Hokkien Speech Translation

Hokkien-to-English Speech Translation

Cross-Language Speech Conversion

Use Cases

Language Communication

Real-Time Speech Translation

Used for real-time conversation translation between English and Hokkien speakers

Enables natural and fluent cross-language communication

Media Content Processing

TED Talk Translation

Automatically translates English TED Talks into Hokkien versions

Expands content audience reach

🚀 xm_transformer_s2ut_en-hk

A speech-to-speech translation model with single-pass decoder (S2UT) from fairseq, enabling English to Hokkien translation.

This model is trained with supervised data in the TED domain and weakly supervised data in the TED and Audiobook domains. For detailed training information, refer to here. It uses facebook/unit_hifigan_HK_layer12.km2500_frame_TAT-TTS for speech synthesis. Check out the Project Page.

✨ Features

Language Pair: English - Hokkien
Training Data: Supervised data from the TED domain and weakly supervised data from the TED and Audiobook domains.
Speech Synthesis: Utilizes facebook/unit_hifigan_HK_layer12.km2500_frame_TAT-TTS

💻 Usage Examples

Basic Usage

import json
import os
from pathlib import Path

import IPython.display as ipd
from fairseq import hub_utils
from fairseq.checkpoint_utils import load_model_ensemble_and_task_from_hf_hub
from fairseq.models.speech_to_text.hub_interface import S2THubInterface
from fairseq.models.text_to_speech import CodeHiFiGANVocoder
from fairseq.models.text_to_speech.hub_interface import VocoderHubInterface

from huggingface_hub import snapshot_download
import torchaudio

cache_dir = os.getenv("HUGGINGFACE_HUB_CACHE")

models, cfg, task = load_model_ensemble_and_task_from_hf_hub(
    "facebook/xm_transformer_s2ut_en-hk",
    arg_overrides={"config_yaml": "config.yaml", "task": "speech_to_text"},
    cache_dir=cache_dir,
)
#model = models[0].cpu()
#cfg["task"].cpu = True
generator = task.build_generator([model], cfg)


# requires 16000Hz mono channel audio
audio, _ = torchaudio.load("/path/to/an/audio/file")

sample = S2THubInterface.get_model_input(task, audio)
unit = S2THubInterface.get_prediction(task, model, generator, sample)

# speech synthesis           
library_name = "fairseq"
cache_dir = (
    cache_dir or (Path.home() / ".cache" / library_name).as_posix()
)
cache_dir = snapshot_download(
    f"facebook/unit_hifigan_HK_layer12.km2500_frame_TAT-TTS", cache_dir=cache_dir, library_name=library_name
)

x = hub_utils.from_pretrained(
    cache_dir,
    "model.pt",
    ".",
    archive_map=CodeHiFiGANVocoder.hub_models(),
    config_yaml="config.json",
    fp16=False,
    is_vocoder=True,
)

with open(f"{x['args']['data']}/config.json") as f:
    vocoder_cfg = json.load(f)
assert (
    len(x["args"]["model_path"]) == 1
), "Too many vocoder models in the input"

vocoder = CodeHiFiGANVocoder(x["args"]["model_path"][0], vocoder_cfg)
tts_model = VocoderHubInterface(vocoder_cfg, vocoder)

tts_sample = tts_model.get_model_input(unit)
wav, sr = tts_model.get_prediction(tts_sample)

ipd.Audio(wav, rate=sr)

📄 License

This project is licensed under the CC BY-NC 4.0 license.

📦 Information

Property	Details
Library Name	fairseq
Task	audio-to-audio
Tags	fairseq, audio, audio-to-audio, speech-to-speech-translation
Datasets	MuST-C

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご