🚀 Massively Multilingual Speech (MMS) - Finetuned ASR - FL102
This checkpoint is a fine - tuned model for multi - lingual Automatic Speech Recognition (ASR), which is part of Facebook's Massive Multilingual Speech project. It can transcribe over 100 languages, offering high - efficiency and accurate speech recognition services.
🚀 Quick Start
This MMS checkpoint can be used with Transformers to transcribe audio of 1107 different languages. Let's look at a simple example.
Basic Usage
First, we install transformers and some other libraries
pip install torch accelerate torchaudio datasets
pip install --upgrade transformers
Note: In order to use MMS you need to have at least transformers >= 4.30
installed. If the 4.30
version is not yet available on PyPI make sure to install transformers
from source:
pip install git+https://github.com/huggingface/transformers.git
Next, we load a couple of audio samples via datasets
. Make sure that the audio data is sampled to 16000 kHz.
from datasets import load_dataset, Audio
stream_data = load_dataset("mozilla-foundation/common_voice_13_0", "en", split="test", streaming=True)
stream_data = stream_data.cast_column("audio", Audio(sampling_rate=16000))
en_sample = next(iter(stream_data))["audio"]["array"]
stream_data = load_dataset("mozilla-foundation/common_voice_13_0", "fr", split="test", streaming=True)
stream_data = stream_data.cast_column("audio", Audio(sampling_rate=16000))
fr_sample = next(iter(stream_data))["audio"]["array"]
Next, we load the model and processor
from transformers import Wav2Vec2ForCTC, AutoProcessor
import torch
model_id = "facebook/mms-1b-fl102"
processor = AutoProcessor.from_pretrained(model_id)
model = Wav2Vec2ForCTC.from_pretrained(model_id)
Now we process the audio data, pass the processed audio data to the model and transcribe the model output, just like we usually do for Wav2Vec2 models such as facebook/wav2vec2-base-960h
inputs = processor(en_sample, sampling_rate=16_000, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs).logits
ids = torch.argmax(outputs, dim=-1)[0]
transcription = processor.decode(ids)
Advanced Usage
We can now keep the same model in memory and simply switch out the language adapters by calling the convenient load_adapter()
function for the model and set_target_lang()
for the tokenizer. We pass the target language as an input - "fra" for French.
processor.tokenizer.set_target_lang("fra")
model.load_adapter("fra")
inputs = processor(fr_sample, sampling_rate=16_000, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs).logits
ids = torch.argmax(outputs, dim=-1)[0]
transcription = processor.decode(ids)
In the same way the language can be switched out for all other supported languages. Please have a look at:
processor.tokenizer.vocab.keys()
For more details, please have a look at the official docs.
✨ Features
- Multilingual Support: This model supports 102 languages, providing a wide - range of speech recognition services.
- Fine - tuned Model: Based on the Wav2Vec2 architecture, it uses adapter models to achieve high - quality speech recognition.
📦 Installation
pip install torch accelerate torchaudio datasets
pip install --upgrade transformers
If the 4.30
version of transformers
is not yet available on PyPI, install it from source:
pip install git+https://github.com/huggingface/transformers.git
📚 Documentation
Supported Languages
This model supports 102 languages. Unclick the following to toogle all supported languages of this checkpoint in ISO 639 - 3 code.
You can find more details about the languages and their ISO 649 - 3 codes in the MMS Language Coverage Overview.
Click to toggle
- afr
- amh
- ara
- asm
- ast
- azj-script_latin
- bel
- ben
- bos
- bul
- cat
- ceb
- ces
- ckb
- cmn-script_simplified
- cym
- dan
- deu
- ell
- eng
- est
- fas
- fin
- fra
- ful
- gle
- glg
- guj
- hau
- heb
- hin
- hrv
- hun
- hye
- ibo
- ind
- isl
- ita
- jav
- jpn
- kam
- kan
- kat
- kaz
- kea
- khm
- kir
- kor
- lao
- lav
- lin
- lit
- ltz
- lug
- luo
- mal
- mar
- mkd
- mlt
- mon
- mri
- mya
- nld
- nob
- npi
- nso
- nya
- oci
- orm
- ory
- pan
- pol
- por
- pus
- ron
- rus
- slk
- slv
- sna
- snd
- som
- spa
- srp-script_latin
- swe
- swh
- tam
- tel
- tgk
- tgl
- tha
- tur
- ukr
- umb
- urd-script_arabic
- uzb-script_latin
- vie
- wol
- xho
- yor
- yue-script_traditional
- zlm
- zul
Model details
Property |
Details |
Developed by |
Vineel Pratap et al. |
Model Type |
Multi - Lingual Automatic Speech Recognition model |
Language(s) |
100+ languages, see supported languages |
License |
CC - BY - NC 4.0 license |
Num parameters |
1 billion |
Audio sampling rate |
16,000 kHz |
Cite as |
@article{pratap2023mms, title={Scaling Speech Technology to 1,000+ Languages}, author={Vineel Pratap and Andros Tjandra and Bowen Shi and Paden Tomasello and Arun Babu and Sayani Kundu and Ali Elkahky and Zhaoheng Ni and Apoorv Vyas and Maryam Fazel - Zarandi and Alexei Baevski and Yossi Adi and Xiaohui Zhang and Wei - Ning Hsu and Alexis Conneau and Michael Auli}, journal={arXiv}, year={2023} } |
Additional Links
📄 License
This model is released under the CC - BY - NC 4.0 license.