🚀 xm_transformer_s2ut_hk-en
A speech-to-speech translation model with a single-pass decoder (S2UT) from fairseq, enabling Hokkien-English translation.
This model is trained with supervised data in TED, drama, and TAT domains, as well as weakly supervised data in the drama domain. For detailed training information, refer to here. It utilizes facebook/unit_hifigan_mhubert_vp_en_es_fr_it3_400k_layer11_km1000_lj_dur for speech synthesis. Check out the Project Page.
Property |
Details |
Library Name |
fairseq |
Task |
audio-to-audio |
Tags |
fairseq, audio, audio-to-audio, speech-to-speech-translation |
Datasets |
Must-C, TAT, Hokkien dramas |
License |
cc-by-nc-4.0 |
🚀 Quick Start
This section provides a step - by - step guide on how to use the xm_transformer_s2ut_hk-en
model for speech - to - speech translation.
💻 Usage Examples
Basic Usage
import json
import os
from pathlib import Path
import IPython.display as ipd
from fairseq import hub_utils
from fairseq.checkpoint_utils import load_model_ensemble_and_task_from_hf_hub
from fairseq.models.speech_to_text.hub_interface import S2THubInterface
from fairseq.models.text_to_speech import CodeHiFiGANVocoder
from fairseq.models.text_to_speech.hub_interface import VocoderHubInterface
from huggingface_hub import snapshot_download
import torchaudio
cache_dir = os.getenv("HUGGINGFACE_HUB_CACHE")
models, cfg, task = load_model_ensemble_and_task_from_hf_hub(
"facebook/xm_transformer_s2ut_hk-en",
arg_overrides={"config_yaml": "config.yaml", "task": "speech_to_text"},
cache_dir=cache_dir,
)
generator = task.build_generator([model], cfg)
audio, _ = torchaudio.load("/path/to/an/audio/file")
sample = S2THubInterface.get_model_input(task, audio)
unit = S2THubInterface.get_prediction(task, model, generator, sample)
library_name = "fairseq"
cache_dir = (
cache_dir or (Path.home() / ".cache" / library_name).as_posix()
)
cache_dir = snapshot_download(
f"facebook/unit_hifigan_mhubert_vp_en_es_fr_it3_400k_layer11_km1000_lj_dur", cache_dir=cache_dir, library_name=library_name
)
x = hub_utils.from_pretrained(
cache_dir,
"model.pt",
".",
archive_map=CodeHiFiGANVocoder.hub_models(),
config_yaml="config.json",
fp16=False,
is_vocoder=True,
)
with open(f"{x['args']['data']}/config.json") as f:
vocoder_cfg = json.load(f)
assert (
len(x["args"]["model_path"]) == 1
), "Too many vocoder models in the input"
vocoder = CodeHiFiGANVocoder(x["args"]["model_path"][0], vocoder_cfg)
tts_model = VocoderHubInterface(vocoder_cfg, vocoder)
tts_sample = tts_model.get_model_input(unit)
wav, sr = tts_model.get_prediction(tts_sample)
ipd.Audio(wav, rate=sr)
Advanced Usage
```markdown
> 📄 **License**
>
> This project is licensed under the cc - by - nc - 4.0 license.