đ SeamlessM4T Large
SeamlessM4T is a collection of models that offer high - quality translation services. It enables people from diverse linguistic backgrounds to communicate smoothly through speech and text.
đ Quick Start
SeamlessM4T Large is a powerful unified model that supports multiple translation and recognition tasks. It can handle various input types (speech and text) and generate different output types (speech and text) across multiple languages.
⨠Features
- Multilingual Support:
- đĨ Supports 101 languages for speech input.
- â¨ī¸ Supports 96 Languages for text input/output.
- đŖī¸ Supports 35 languages for speech output.
- Multi - task Capability: Enables multiple tasks without relying on multiple separate models, including:
- Speech - to - speech translation (S2ST)
- Speech - to - text translation (S2TT)
- Text - to - speech translation (T2ST)
- Text - to - text translation (T2TT)
- Automatic speech recognition (ASR)
đĻ Installation
The model can be installed via the transformers
library. Ensure you have the transformers
library installed. You can install it using pip
:
pip install transformers
đģ Usage Examples
Basic Usage
First, load the processor and a checkpoint of the model:
>>> from transformers import AutoProcessor, SeamlessM4TModel
>>> processor = AutoProcessor.from_pretrained("facebook/hf-seamless-m4t-large")
>>> model = SeamlessM4TModel.from_pretrained("facebook/hf-seamless-m4t-large")
Here is how to use the processor to process text and audio:
>>>
>>> from datasets import load_dataset
>>> dataset = load_dataset("arabic_speech_corpus", split="test", streaming=True)
>>> audio_sample = next(iter(dataset))["audio"]
>>>
>>> audio_inputs = processor(audios=audio_sample["array"], return_tensors="pt")
>>>
>>> text_inputs = processor(text = "Hello, my dog is cute", src_lang="eng", return_tensors="pt")
Advanced Usage
Speech Translation
>>> audio_array_from_text = model.generate(**text_inputs, tgt_lang="rus")[0].cpu().numpy().squeeze()
>>> audio_array_from_audio = model.generate(**audio_inputs, tgt_lang="rus")[0].cpu().numpy().squeeze()
Text Translation
>>>
>>> output_tokens = model.generate(**audio_inputs, tgt_lang="fra", generate_speech=False)
>>> translated_text_from_audio = processor.decode(output_tokens[0].tolist(), skip_special_tokens=True)
>>>
>>> output_tokens = model.generate(**text_inputs, tgt_lang="fra", generate_speech=False)
>>> translated_text_from_text = processor.decode(output_tokens[0].tolist(), skip_special_tokens=True)
Tips
1. Use dedicated models
You can use dedicated models to reduce the memory footprint. For example:
>>> from transformers import SeamlessM4TForSpeechToSpeech
>>> model = SeamlessM4TForSpeechToSpeech.from_pretrained("facebook/hf-seamless-m4t-large")
Or for text - to - text translation:
>>> from transformers import SeamlessM4TForTextToText
>>> model = SeamlessM4TForTextToText.from_pretrained("facebook/hf-seamless-m4t-large")
You can also try out SeamlessM4TForSpeechToText
and SeamlessM4TForTextToSpeech
.
2. Change the speaker identity
You can change the speaker used for speech synthesis with the spkr_id
argument. Some spkr_id
works better than others for some languages.
3. Change the generation strategy
You can use different generation strategies for speech and text generation, e.g .generate(input_ids=input_ids, text_num_beams=4, speech_do_sample=True)
.
4. Generate speech and text at the same time
Use return_intermediate_token_ids=True
with SeamlessM4TModel
to return both speech and text.
đ Documentation
- New Version Information:
- SeamlessM4T v2, an improved version with a novel architecture, has been released here. It improves over SeamlessM4T v1 in quality and inference speed in speech generation tasks.
- SeamlessM4T v2 is also supported by đ¤ Transformers. More information can be found in the model card of this new version or directly in đ¤ Transformers docs.
đ License
This model is licensed under the cc - by - nc - 4.0
license.