hf-seamless-m4t-medium Open-source Multilingual Translation Model - Free Cross-language Communication for Speech and Text

Hf Seamless M4t Medium

Developed by facebook

SeamlessM4T is a multilingual translation model that supports both speech and text input/output, enabling cross-language communication.

Text-to-Audio

Transformers

#Multilingual Speech Translation #Speech-to-Text and Text-to-Speech Conversion #Unified Multitask Model

Downloads 14.74k

Release Time : 8/28/2023

Model Overview

SeamlessM4T is a unified translation model that supports speech-to-speech, speech-to-text, text-to-speech, and text-to-text translation tasks across multiple languages.

Model Features

Multilingual Support

Supports translation for over 100 languages in both speech and text, covering a wide range of linguistic needs.

Unified Model Architecture

A single model handles multiple translation tasks without relying on separate independent models.

Speech and Text Interconversion

Supports bidirectional conversion between speech and text, enabling seamless communication.

Model Capabilities

Speech-to-Speech Translation

Speech-to-Text Translation

Text-to-Speech Translation

Text-to-Text Translation

Automatic Speech Recognition

Use Cases

Cross-Language Communication

Real-Time Speech Translation

Translates speech from one language to another in real-time, either as speech or text.

Enables barrier-free communication between speakers of different languages.

Multilingual Content Creation

Quickly translates text or speech content into multiple language versions.

Enhances efficiency and multilingual coverage in content creation.

Assistive Tools

Speech Transcription

Automatically transcribes speech content into text.

Improves accessibility and searchability of speech content.

🚀 SeamlessM4T Medium

SeamlessM4T is a collection of models crafted to offer high - quality translation, enabling individuals from diverse linguistic communities to communicate effortlessly via speech and text.

This repository holds 🤗 Hugging Face's implementation of SeamlessM4T. You can access the original weights and a guide on how to run them in the original hub repositories (large and medium checkpoints).

🚀 Quick Start

Model Information

Property	Details
Model Type	SeamlessM4T Medium
Inference	true
Tags	SeamlessM4T, seamless_m4t
License	cc - by - nc - 4.0
Library Name	transformers
Pipeline Tag	text - to - speech

New Version Notice

🌟 SeamlessM4T v2, an enhanced version of this model with a novel architecture, has been released here. This new model outperforms SeamlessM4T v1 in both quality and inference speed for speech generation tasks.

SeamlessM4T v2 is also supported by 🤗 Transformers. For more details, refer to the model card of this new version or directly in 🤗 Transformers docs.

Model Capabilities

SeamlessM4T Medium supports:

📥 101 languages for speech input
⌨️ 196 Languages for text input/output
🗣️ 35 languages for speech output.

This is the "medium" variant of the unified model, capable of handling multiple tasks without relying on multiple separate models:

Speech - to - speech translation (S2ST)
Speech - to - text translation (S2TT)
Text - to - speech translation (T2ST)
Text - to - text translation (T2TT)
Automatic speech recognition (ASR)

You can perform all the above tasks using a single model, SeamlessM4TModel, and each task also has its own dedicated sub - model.

✨ Features

High - quality translation across multiple languages.
Unified model for multiple translation and recognition tasks.
Support for both text and speech input/output.

📦 Installation

The installation is mainly about loading the model and related processors. You need to have the transformers and datasets libraries installed. You can install them using pip:

pip install transformers datasets

💻 Usage Examples

Basic Usage

First, load the processor and a checkpoint of the model:

>>> from transformers import AutoProcessor, SeamlessM4TModel

>>> processor = AutoProcessor.from_pretrained("facebook/hf-seamless-m4t-medium")
>>> model = SeamlessM4TModel.from_pretrained("facebook/hf-seamless-m4t-medium")

You can use this model on text or audio to generate either translated text or translated audio. Here is how to use the processor to process text and audio:

>>> # let's load an audio sample from an Arabic speech corpus
>>> from datasets import load_dataset
>>> dataset = load_dataset("arabic_speech_corpus", split="test", streaming=True)
>>> audio_sample = next(iter(dataset))["audio"]

>>> # now, process it
>>> audio_inputs = processor(audios=audio_sample["array"], return_tensors="pt")

>>> # now, process some English test as well
>>> text_inputs = processor(text = "Hello, my dog is cute", src_lang="eng", return_tensors="pt")

Advanced Usage

Speech

SeamlessM4TModel can seamlessly generate text or speech with few or no changes. Let's target Russian voice translation:

>>> audio_array_from_text = model.generate(**text_inputs, tgt_lang="rus")[0].cpu().numpy().squeeze()
>>> audio_array_from_audio = model.generate(**audio_inputs, tgt_lang="rus")[0].cpu().numpy().squeeze()

Text

Similarly, you can generate translated text from audio files or from text with the same model. You only have to pass generate_speech=False to SeamlessM4TModel.generate. This time, let's translate to French.

>>> # from audio
>>> output_tokens = model.generate(**audio_inputs, tgt_lang="fra", generate_speech=False)
>>> translated_text_from_audio = processor.decode(output_tokens[0].tolist(), skip_special_tokens=True)

>>> # from text
>>> output_tokens = model.generate(**text_inputs, tgt_lang="fra", generate_speech=False)
>>> translated_text_from_text = processor.decode(output_tokens[0].tolist(), skip_special_tokens=True)

Tips

1. Use dedicated models

SeamlessM4TModel is the top - level model in transformers for generating speech and text. However, you can also use dedicated models that perform the task without additional components, thus reducing the memory footprint. For example, you can replace the audio - to - audio generation snippet with the model dedicated to the S2ST task, and the rest of the code remains the same:

>>> from transformers import SeamlessM4TForSpeechToSpeech
>>> model = SeamlessM4TForSpeechToSpeech.from_pretrained("facebook/hf-seamless-m4t-medium")

Or you can replace the text - to - text generation snippet with the model dedicated to the T2TT task. You only have to remove generate_speech=False.

>>> from transformers import SeamlessM4TForTextToText
>>> model = SeamlessM4TForTextToText.from_pretrained("facebook/hf-seamless-m4t-medium")

Feel free to try out SeamlessM4TForSpeechToText and SeamlessM4TForTextToSpeech as well.

2. Change the speaker identity

You can change the speaker used for speech synthesis with the spkr_id argument. Some spkr_id values work better for certain languages!

3. Change the generation strategy

You can use different generation strategies for speech and text generation. For example, .generate(input_ids=input_ids, text_num_beams=4, speech_do_sample=True) will perform beam - search decoding on the text model and multinomial sampling on the speech model.

4. Generate speech and text at the same time

Use return_intermediate_token_ids=True with SeamlessM4TModel to return both speech and text!

📄 License

This project is licensed under the cc - by - nc - 4.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご