xm_transformer_s2ut_hk - en Open-source Model - A Speech Translation Tool Supporting Mutual Translation between Min Nan Dialect and English

Xm Transformer S2ut Hk En

Developed by facebook

A fairseq-based single-decoder speech-to-speech translation (S2UT) model supporting Hokkien-English bidirectional translation

Speech Synthesis #Hokkien-English Speech Translation #End-to-End Speech Translation #TV Series Domain Adaptation

Downloads 17

Release Time : 10/7/2022

Model Overview

This model is a speech-to-speech translation system capable of directly translating Hokkien to English speech or vice versa. It employs a Transformer architecture combined with a HiFi-GAN vocoder for speech synthesis.

Model Features

End-to-End Speech Translation

Directly converts source language speech to target language speech without intermediate text representation

Multi-Domain Training Data

Trained using supervised and weakly supervised data from domains like TED talks, TV series, and TAT corpus

High-Quality Speech Synthesis

Utilizes unit_hifigan_mhubert vocoder model to generate natural and fluent target speech

Model Capabilities

Hokkien-to-English Speech Translation

English-to-Hokkien Speech Translation

Direct Speech-to-Speech Conversion

Use Cases

Cross-Language Communication

Hokkien-English Real-Time Translation

Facilitates real-time verbal communication between Hokkien and English speakers

Media Content Localization

TV Series Dubbing

Automatically translates and dubs Hokkien TV series into English versions

🚀 xm_transformer_s2ut_hk-en

A speech-to-speech translation model with a single-pass decoder (S2UT) from fairseq, enabling Hokkien-English translation.

This model is trained with supervised data in TED, drama, and TAT domains, as well as weakly supervised data in the drama domain. For detailed training information, refer to here. It utilizes facebook/unit_hifigan_mhubert_vp_en_es_fr_it3_400k_layer11_km1000_lj_dur for speech synthesis. Check out the Project Page.

Property	Details
Library Name	fairseq
Task	audio-to-audio
Tags	fairseq, audio, audio-to-audio, speech-to-speech-translation
Datasets	Must-C, TAT, Hokkien dramas
License	cc-by-nc-4.0

🚀 Quick Start

This section provides a step - by - step guide on how to use the xm_transformer_s2ut_hk-en model for speech - to - speech translation.

💻 Usage Examples

Basic Usage

import json
import os
from pathlib import Path

import IPython.display as ipd
from fairseq import hub_utils
from fairseq.checkpoint_utils import load_model_ensemble_and_task_from_hf_hub
from fairseq.models.speech_to_text.hub_interface import S2THubInterface
from fairseq.models.text_to_speech import CodeHiFiGANVocoder
from fairseq.models.text_to_speech.hub_interface import VocoderHubInterface

from huggingface_hub import snapshot_download
import torchaudio

cache_dir = os.getenv("HUGGINGFACE_HUB_CACHE")

models, cfg, task = load_model_ensemble_and_task_from_hf_hub(
    "facebook/xm_transformer_s2ut_hk-en",
    arg_overrides={"config_yaml": "config.yaml", "task": "speech_to_text"},
    cache_dir=cache_dir,
)
#model = models[0].cpu()
#cfg["task"].cpu = True
generator = task.build_generator([model], cfg)


# requires 16000Hz mono channel audio
audio, _ = torchaudio.load("/path/to/an/audio/file")

sample = S2THubInterface.get_model_input(task, audio)
unit = S2THubInterface.get_prediction(task, model, generator, sample)

# speech synthesis           
library_name = "fairseq"
cache_dir = (
    cache_dir or (Path.home() / ".cache" / library_name).as_posix()
)
cache_dir = snapshot_download(
    f"facebook/unit_hifigan_mhubert_vp_en_es_fr_it3_400k_layer11_km1000_lj_dur", cache_dir=cache_dir, library_name=library_name
)

x = hub_utils.from_pretrained(
    cache_dir,
    "model.pt",
    ".",
    archive_map=CodeHiFiGANVocoder.hub_models(),
    config_yaml="config.json",
    fp16=False,
    is_vocoder=True,
)

with open(f"{x['args']['data']}/config.json") as f:
    vocoder_cfg = json.load(f)
assert (
    len(x["args"]["model_path"]) == 1
), "Too many vocoder models in the input"

vocoder = CodeHiFiGANVocoder(x["args"]["model_path"][0], vocoder_cfg)
tts_model = VocoderHubInterface(vocoder_cfg, vocoder)

tts_sample = tts_model.get_model_input(unit)
wav, sr = tts_model.get_prediction(tts_sample)

ipd.Audio(wav, rate=sr)

Advanced Usage

# There is no specific advanced usage code provided in the original document.
# You can expand the functionality based on the basic usage, such as processing multiple audio files in a loop.

```markdown
> 📄 **License**
> 
> This project is licensed under the cc - by - nc - 4.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご