đ Massively Multilingual Speech (MMS) - Finetuned LID
This checkpoint is a model fine-tuned for speech language identification (LID), which is part of Facebook's Massive Multilingual Speech project. It can classify raw audio input into a probability distribution over 512 output classes, with each class representing a language.
đ Quick Start
This MMS checkpoint can be used with Transformers to identify the spoken language of an audio. It can recognize the following 512 languages.
Let's look at a simple example.
First, we install transformers and some other libraries
pip install torch accelerate torchaudio datasets
pip install --upgrade transformers
Note: In order to use MMS you need to have at least transformers >= 4.30
installed. If the 4.30
version is not yet available on PyPI make sure to install transformers
from source:
pip install git+https://github.com/huggingface/transformers.git
Next, we load a couple of audio samples via datasets
. Make sure that the audio data is sampled to 16000 kHz.
from datasets import load_dataset, Audio
stream_data = load_dataset("mozilla-foundation/common_voice_13_0", "en", split="test", streaming=True)
stream_data = stream_data.cast_column("audio", Audio(sampling_rate=16000))
en_sample = next(iter(stream_data))["audio"]["array"]
stream_data = load_dataset("mozilla-foundation/common_voice_13_0", "ar", split="test", streaming=True)
stream_data = stream_data.cast_column("audio", Audio(sampling_rate=16000))
ar_sample = next(iter(stream_data))["audio"]["array"]
Next, we load the model and processor
from transformers import Wav2Vec2ForSequenceClassification, AutoFeatureExtractor
import torch
model_id = "facebook/mms-lid-512"
processor = AutoFeatureExtractor.from_pretrained(model_id)
model = Wav2Vec2ForSequenceClassification.from_pretrained(model_id)
Now we process the audio data, pass the processed audio data to the model to classify it into a language, just like we usually do for Wav2Vec2 audio classification models such as ehcalabres/wav2vec2-lg-xlsr-en-speech-emotion-recognition
inputs = processor(en_sample, sampling_rate=16_000, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs).logits
lang_id = torch.argmax(outputs, dim=-1)[0].item()
detected_lang = model.config.id2label[lang_id]
inputs = processor(ar_sample, sampling_rate=16_000, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs).logits
lang_id = torch.argmax(outputs, dim=-1)[0].item()
detected_lang = model.config.id2label[lang_id]
To see all the supported languages of a checkpoint, you can print out the language ids as follows:
processor.id2label.values()
For more details, about the architecture please have a look at the official docs.
⨠Features
- Multilingual Support: This model supports 512 languages, providing wide - ranging language recognition capabilities.
- Fine - tuned for LID: Specifically fine - tuned for speech language identification, ensuring high accuracy in language classification.
đģ Usage Examples
Basic Usage
from datasets import load_dataset, Audio
from transformers import Wav2Vec2ForSequenceClassification, AutoFeatureExtractor
import torch
stream_data = load_dataset("mozilla-foundation/common_voice_13_0", "en", split="test", streaming=True)
stream_data = stream_data.cast_column("audio", Audio(sampling_rate=16000))
en_sample = next(iter(stream_data))["audio"]["array"]
model_id = "facebook/mms-lid-512"
processor = AutoFeatureExtractor.from_pretrained(model_id)
model = Wav2Vec2ForSequenceClassification.from_pretrained(model_id)
inputs = processor(en_sample, sampling_rate=16_000, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs).logits
lang_id = torch.argmax(outputs, dim=-1)[0].item()
detected_lang = model.config.id2label[lang_id]
Advanced Usage
languages = ["en", "ar"]
for lang in languages:
stream_data = load_dataset("mozilla-foundation/common_voice_13_0", lang, split="test", streaming=True)
stream_data = stream_data.cast_column("audio", Audio(sampling_rate=16000))
sample = next(iter(stream_data))["audio"]["array"]
inputs = processor(sample, sampling_rate=16_000, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs).logits
lang_id = torch.argmax(outputs, dim=-1)[0].item()
detected_lang = model.config.id2label[lang_id]
print(f"Detected language for {lang}: {detected_lang}")
đ Documentation
Supported Languages
This model supports 512 languages. Unclick the following to toogle all supported languages of this checkpoint in ISO 639-3 code.
You can find more details about the languages and their ISO 649-3 codes in the MMS Language Coverage Overview.
Click to toggle
- ara
- cmn
- eng
- spa
- fra
- mlg
- swe
- por
- vie
- ful
- sun
- asm
- ben
- zlm
- kor
- ind
- hin
- tuk
- urd
- aze
- slv
- mon
- hau
- tel
- swh
- bod
- rus
- tur
- heb
- mar
- som
- tgl
- tat
- tha
- cat
- ron
- mal
- bel
- pol
- yor
- nld
- bul
- hat
- afr
- isl
- amh
- tam
- hun
- hrv
- lit
- cym
- fas
- mkd
- ell
- bos
- deu
- sqi
- jav
- kmr
- nob
- uzb
- snd
- lat
- nya
- grn
- mya
- orm
- lin
- hye
- yue
- pan
- jpn
- kaz
- npi
- kik
- kat
- guj
- kan
- tgk
- ukr
- ces
- lav
- bak
- khm
- cak
- fao
- glg
- ltz
- xog
- lao
- mlt
- sin
- aka
- sna
- che
- mam
- ita
- quc
- srp
- mri
- tuv
- nno
- pus
- eus
- kbp
- ory
- lug
- bre
- luo
- nhx
- slk
- ewe
- fin
- rif
- dan
- yid
- yao
- mos
- quh
- hne
- xon
- new
- quy
- est
- dyu
- ttq
- bam
- pse
- uig
- sck
- ngl
- tso
- mup
- dga
- seh
- lis
- wal
- ctg
- bfz
- bxk
- ceb
- kru
- war
- khg
- bbc
- thl
- vmw
- zne
- sid
- tpi
- nym
- bgq
- bfy
- hlb
- teo
- fon
- kfx
- bfa
- mag
- ayr
- any
- mnk
- adx
- ava
- hyw
- san
- kek
- chv
- kri
- btx
- nhy
- dnj
- lon
- men
- ium
- nga
- nsu
- prk
- kir
- bom
- run
- hwc
- mnw
- ubl
- kin
- rkt
- xmm
- iba
- gux
- ses
- wsg
- tir
- gbm
- mai
- nyy
- nan
- nyn
- gog
- ngu
- hoc
- nyf
- sus
- bcc
- hak
- grt
- suk
- nij
- kaa
- bem
- rmy
- nus
- ach
- awa
- dip
- rim
- nhe
- pcm
- kde
- tem
- quz
- bba
- kbr
- taj
- dik
- dgo
- bgc
- xnr
- kac
- laj
- dag
- ktb
- mgh
- shn
- oci
- zyb
- alz
- wol
- guw
- nia
- bci
- sba
- kab
- nnb
- ilo
- mfe
- xpe
- bcl
- haw
- mad
- ljp
- gmv
- nyo
- kxm
- nod
- sag
- sas
- myx
- sgw
- mak
- kfy
- jam
- lgg
- nhi
- mey
- sgj
- hay
- pam
- heh
- nhw
- yua
- shi
- mrw
- hil
- pag
- cce
- npl
- ace
- kam
- min
- pko
- toi
- ncj
- umb
- hno
- ban
- syl
- bxg
- nse
- xho
- mkw
- nch
- mas
- bum
- mww
- epo
- tzm
- zul
- lrc
- ibo
- abk
- azz
- guz
- ksw
- lus
- ckb
- mer
- pov
- rhg
- knc
- tum
- nso
- bho
- ndc
- ijc
- qug
- lub
- srr
- mni
- zza
- dje
- tiv
- gle
- lua
- swk
- ada
- lic
- skr
- mfa
- bto
- unr
- hdy
- kea
- glk
- ast
- nup
- sat
- ktu
- bhb
- sgc
- dks
- ncl
- emk
- urh
- tsc
- idu
- igb
- its
- kng
- kmb
- tsn
- bin
- gom
- ven
- sef
- sco
- trp
- glv
- haq
- kha
- rmn
- sot
- sou
- gno
- igl
- efi
- nde
- rki
- kjg
- fan
- wci
- bjn
- pmy
- bqi
- ina
- hni
- the
- nuz
- ajg
- ymm
- fmu
- nyk
- snk
- esg
- thq
- pht
- wes
- pnb
- phr
- mui
- tkt
- bug
- mrr
- kas
- zgb
- lir
- vah
- ssw
- iii
- brx
- rwr
- kmc
- dib
- pcc
- zyn
- hea
- hms
- thr
- wbr
- bfb
- wtm
- blk
- dhd
- swv
- zzj
- niq
- mtr
- gju
- kjp
- haz
- shy
- nbl
- aii
- sjp
- bns
- brh
- msi
- tsg
- tcy
- kbl
- noe
- tyz
- ahr
- aar
- wuu
- kbd
- bca
- pwr
- hsn
- kua
- tdd
- bgp
- abs
- zlj
- ebo
- bra
- nhp
- tts
- zyj
- lmn
- cqd
- dcc
- cjk
- bfr
- bew
- arg
- drs
- chw
- bej
- bjj
- ibb
- tig
- nut
- jax
- tdg
- nlv
- pch
- fvr
- mlq
- kfr
- nhn
- tji
- hoj
- cpx
- cdo
- bgn
- btm
- trf
- daq
- max
- nba
- mut
- hnd
- ryu
- abr
- sop
- odk
- nap
- gbr
- czh
- vls
- gdx
- yaf
- sdh
- anw
- ttj
- nhg
- cgg
- ifm
- mdh
- scn
- lki
- luz
- stv
- kmz
- nds
- mtq
- knn
- mnp
- bar
- mzn
- gsw
- fry
Model details
Property |
Details |
Developed by |
Vineel Pratap et al. |
Model Type |
Multi - Lingual Automatic Speech Recognition model |
Language(s) |
512 languages, see supported languages |
License |
CC - BY - NC 4.0 license |
Num parameters |
1 billion |
Audio sampling rate |
16,000 kHz |
Cite as |
@article{pratap2023mms,<br> title={Scaling Speech Technology to 1,000+ Languages},<br> author={Vineel Pratap and Andros Tjandra and Bowen Shi and Paden Tomasello and Arun Babu and Sayani Kundu and Ali Elkahky and Zhaoheng Ni and Apoorv Vyas and Maryam Fazel - Zarandi and Alexei Baevski and Yossi Adi and Xiaohui Zhang and Wei - Ning Hsu and Alexis Conneau and Michael Auli},<br> journal={arXiv},<br> year={2023}<br>} |
Additional Links
đ License
This model is released under the CC - BY - NC 4.0 license.