MMS-1B-FL102 Open-source Speech Recognition Model - A Free Multilingual Transcription Tool Supporting 102 Languages

Mms 1b Fl102

Developed by facebook

MMS-1B-FL102 is part of Facebook's Massively Multilingual Speech project, an automatic speech recognition model supporting 102 languages, based on the 1-billion-parameter Wav2Vec2 architecture, achieving multilingual transcription through adapter technology.

Speech Recognition

Transformers

Supports Multiple Languages#Multilingual Speech Recognition #Supports 102 Languages #1 Billion Parameter Model

Downloads 6,360

Release Time : 5/27/2023

Model Overview

This model is a checkpoint specifically fine-tuned for multilingual automatic speech recognition (ASR), capable of transcribing audio in over 100 languages. It is based on the Wav2Vec2 architecture and utilizes adapter technology for multilingual support, fine-tuned from facebook/mms-1b on the Fleurs dataset for 102 languages.

Model Features

Multilingual Support

Supports speech recognition in 102 languages, including many low-resource languages

Adapter Technology

Uses adapter models for language switching without reloading the entire model

Large-scale Pretraining

Based on the 1-billion-parameter Wav2Vec2 architecture with powerful speech recognition capabilities

Model Capabilities

Multilingual Speech Recognition

Real-time Speech Transcription

Language Adapter Switching

Use Cases

Speech Transcription

Multilingual Meeting Transcription

Real-time transcription of meetings involving multiple languages

Speech Content Localization

Transcribe speech content and translate it into other languages

Voice Assistants

Multilingual Voice Assistant

Develop voice assistant applications supporting multiple languages

🚀 Massively Multilingual Speech (MMS) - Finetuned ASR - FL102

This checkpoint is a fine - tuned model for multi - lingual Automatic Speech Recognition (ASR), which is part of Facebook's Massive Multilingual Speech project. It can transcribe over 100 languages, offering high - efficiency and accurate speech recognition services.

🚀 Quick Start

This MMS checkpoint can be used with Transformers to transcribe audio of 1107 different languages. Let's look at a simple example.

Basic Usage

First, we install transformers and some other libraries

pip install torch accelerate torchaudio datasets
pip install --upgrade transformers

Note: In order to use MMS you need to have at least transformers >= 4.30 installed. If the 4.30 version is not yet available on PyPI make sure to install transformers from source:

pip install git+https://github.com/huggingface/transformers.git

Next, we load a couple of audio samples via datasets. Make sure that the audio data is sampled to 16000 kHz.

from datasets import load_dataset, Audio

# English
stream_data = load_dataset("mozilla-foundation/common_voice_13_0", "en", split="test", streaming=True)
stream_data = stream_data.cast_column("audio", Audio(sampling_rate=16000))
en_sample = next(iter(stream_data))["audio"]["array"]

# French
stream_data = load_dataset("mozilla-foundation/common_voice_13_0", "fr", split="test", streaming=True)
stream_data = stream_data.cast_column("audio", Audio(sampling_rate=16000))
fr_sample = next(iter(stream_data))["audio"]["array"]

Next, we load the model and processor

from transformers import Wav2Vec2ForCTC, AutoProcessor
import torch

model_id = "facebook/mms-1b-fl102"

processor = AutoProcessor.from_pretrained(model_id)
model = Wav2Vec2ForCTC.from_pretrained(model_id)

Now we process the audio data, pass the processed audio data to the model and transcribe the model output, just like we usually do for Wav2Vec2 models such as facebook/wav2vec2-base-960h

inputs = processor(en_sample, sampling_rate=16_000, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs).logits

ids = torch.argmax(outputs, dim=-1)[0]
transcription = processor.decode(ids)
# 'joe keton disapproved of films and buster also had reservations about the media'

Advanced Usage

We can now keep the same model in memory and simply switch out the language adapters by calling the convenient load_adapter() function for the model and set_target_lang() for the tokenizer. We pass the target language as an input - "fra" for French.

processor.tokenizer.set_target_lang("fra")
model.load_adapter("fra")

inputs = processor(fr_sample, sampling_rate=16_000, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs).logits

ids = torch.argmax(outputs, dim=-1)[0]
transcription = processor.decode(ids)
# "ce dernier est volé tout au long de l'histoire romaine"

In the same way the language can be switched out for all other supported languages. Please have a look at:

processor.tokenizer.vocab.keys()

For more details, please have a look at the official docs.

✨ Features

Multilingual Support: This model supports 102 languages, providing a wide - range of speech recognition services.
Fine - tuned Model: Based on the Wav2Vec2 architecture, it uses adapter models to achieve high - quality speech recognition.

📦 Installation

pip install torch accelerate torchaudio datasets
pip install --upgrade transformers

If the 4.30 version of transformers is not yet available on PyPI, install it from source:

pip install git+https://github.com/huggingface/transformers.git

📚 Documentation

Supported Languages

This model supports 102 languages. Unclick the following to toogle all supported languages of this checkpoint in ISO 639 - 3 code. You can find more details about the languages and their ISO 649 - 3 codes in the MMS Language Coverage Overview.

Click to toggle

afr
amh
ara
asm
ast
azj-script_latin
bel
ben
bos
bul
cat
ceb
ces
ckb
cmn-script_simplified
cym
dan
deu
ell
eng
est
fas
fin
fra
ful
gle
glg
guj
hau
heb
hin
hrv
hun
hye
ibo
ind
isl
ita
jav
jpn
kam
kan
kat
kaz
kea
khm
kir
kor
lao
lav
lin
lit
ltz
lug
luo
mal
mar
mkd
mlt
mon
mri
mya
nld
nob
npi
nso
nya
oci
orm
ory
pan
pol
por
pus
ron
rus
slk
slv
sna
snd
som
spa
srp-script_latin
swe
swh
tam
tel
tgk
tgl
tha
tur
ukr
umb
urd-script_arabic
uzb-script_latin
vie
wol
xho
yor
yue-script_traditional
zlm
zul

Model details

Property	Details
Developed by	Vineel Pratap et al.
Model Type	Multi - Lingual Automatic Speech Recognition model
Language(s)	100+ languages, see supported languages
License	CC - BY - NC 4.0 license
Num parameters	1 billion
Audio sampling rate	16,000 kHz
Cite as	@article{pratap2023mms, title={Scaling Speech Technology to 1,000+ Languages}, author={Vineel Pratap and Andros Tjandra and Bowen Shi and Paden Tomasello and Arun Babu and Sayani Kundu and Ali Elkahky and Zhaoheng Ni and Apoorv Vyas and Maryam Fazel - Zarandi and Alexei Baevski and Yossi Adi and Xiaohui Zhang and Wei - Ning Hsu and Alexis Conneau and Michael Auli}, journal={arXiv}, year={2023} }

Additional Links

Blog post
Transformers documentation.
Paper
GitHub Repository
Other MMS checkpoints
MMS base checkpoints:
- facebook/mms-1b
- facebook/mms-300m
Official Space

📄 License

This model is released under the CC - BY - NC 4.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご