đ SeamlessM4T Medium
SeamlessM4T is a collection of models crafted to offer high - quality translation, enabling individuals from diverse linguistic communities to communicate effortlessly via speech and text.
This repository holds đ¤ Hugging Face's implementation of SeamlessM4T. You can access the original weights and a guide on how to run them in the original hub repositories (large and medium checkpoints).
đ Quick Start
Model Information
Property |
Details |
Model Type |
SeamlessM4T Medium |
Inference |
true |
Tags |
SeamlessM4T, seamless_m4t |
License |
cc - by - nc - 4.0 |
Library Name |
transformers |
Pipeline Tag |
text - to - speech |
New Version Notice
đ SeamlessM4T v2, an enhanced version of this model with a novel architecture, has been released here. This new model outperforms SeamlessM4T v1 in both quality and inference speed for speech generation tasks.
SeamlessM4T v2 is also supported by đ¤ Transformers. For more details, refer to the model card of this new version or directly in đ¤ Transformers docs.
Model Capabilities
SeamlessM4T Medium supports:
This is the "medium" variant of the unified model, capable of handling multiple tasks without relying on multiple separate models:
- Speech - to - speech translation (S2ST)
- Speech - to - text translation (S2TT)
- Text - to - speech translation (T2ST)
- Text - to - text translation (T2TT)
- Automatic speech recognition (ASR)
You can perform all the above tasks using a single model, SeamlessM4TModel
, and each task also has its own dedicated sub - model.
⨠Features
- High - quality translation across multiple languages.
- Unified model for multiple translation and recognition tasks.
- Support for both text and speech input/output.
đĻ Installation
The installation is mainly about loading the model and related processors. You need to have the transformers
and datasets
libraries installed. You can install them using pip
:
pip install transformers datasets
đģ Usage Examples
Basic Usage
First, load the processor and a checkpoint of the model:
>>> from transformers import AutoProcessor, SeamlessM4TModel
>>> processor = AutoProcessor.from_pretrained("facebook/hf-seamless-m4t-medium")
>>> model = SeamlessM4TModel.from_pretrained("facebook/hf-seamless-m4t-medium")
You can use this model on text or audio to generate either translated text or translated audio. Here is how to use the processor to process text and audio:
>>>
>>> from datasets import load_dataset
>>> dataset = load_dataset("arabic_speech_corpus", split="test", streaming=True)
>>> audio_sample = next(iter(dataset))["audio"]
>>>
>>> audio_inputs = processor(audios=audio_sample["array"], return_tensors="pt")
>>>
>>> text_inputs = processor(text = "Hello, my dog is cute", src_lang="eng", return_tensors="pt")
Advanced Usage
Speech
SeamlessM4TModel
can seamlessly generate text or speech with few or no changes. Let's target Russian voice translation:
>>> audio_array_from_text = model.generate(**text_inputs, tgt_lang="rus")[0].cpu().numpy().squeeze()
>>> audio_array_from_audio = model.generate(**audio_inputs, tgt_lang="rus")[0].cpu().numpy().squeeze()
Text
Similarly, you can generate translated text from audio files or from text with the same model. You only have to pass generate_speech=False
to SeamlessM4TModel.generate
. This time, let's translate to French.
>>>
>>> output_tokens = model.generate(**audio_inputs, tgt_lang="fra", generate_speech=False)
>>> translated_text_from_audio = processor.decode(output_tokens[0].tolist(), skip_special_tokens=True)
>>>
>>> output_tokens = model.generate(**text_inputs, tgt_lang="fra", generate_speech=False)
>>> translated_text_from_text = processor.decode(output_tokens[0].tolist(), skip_special_tokens=True)
Tips
1. Use dedicated models
SeamlessM4TModel
is the top - level model in transformers for generating speech and text. However, you can also use dedicated models that perform the task without additional components, thus reducing the memory footprint.
For example, you can replace the audio - to - audio generation snippet with the model dedicated to the S2ST task, and the rest of the code remains the same:
>>> from transformers import SeamlessM4TForSpeechToSpeech
>>> model = SeamlessM4TForSpeechToSpeech.from_pretrained("facebook/hf-seamless-m4t-medium")
Or you can replace the text - to - text generation snippet with the model dedicated to the T2TT task. You only have to remove generate_speech=False
.
>>> from transformers import SeamlessM4TForTextToText
>>> model = SeamlessM4TForTextToText.from_pretrained("facebook/hf-seamless-m4t-medium")
Feel free to try out SeamlessM4TForSpeechToText
and SeamlessM4TForTextToSpeech
as well.
2. Change the speaker identity
You can change the speaker used for speech synthesis with the spkr_id
argument. Some spkr_id
values work better for certain languages!
3. Change the generation strategy
You can use different generation strategies for speech and text generation. For example, .generate(input_ids=input_ids, text_num_beams=4, speech_do_sample=True)
will perform beam - search decoding on the text model and multinomial sampling on the speech model.
4. Generate speech and text at the same time
Use return_intermediate_token_ids=True
with SeamlessM4TModel
to return both speech and text!
đ License
This project is licensed under the cc - by - nc - 4.0 license.