Model Overview
Model Features
Model Capabilities
Use Cases
đ Massively Multilingual Speech (MMS) - Finetuned LID
This checkpoint is a model fine-tuned for speech language identification (LID) and part of Facebook's Massive Multilingual Speech project, which can classify raw audio input to a probability distribution over 1024 output classes, each representing a language.
đ Quick Start
This MMS checkpoint can be used with Transformers to identify the spoken language of an audio. It can recognize the following 1024 languages.
First, install transformers and some other libraries:
pip install torch accelerate torchaudio datasets
pip install --upgrade transformers
Note: In order to use MMS you need to have at least transformers >= 4.30
installed. If the 4.30
version is not yet available on PyPI make sure to install transformers
from source:
pip install git+https://github.com/huggingface/transformers.git
Next, load a couple of audio samples via datasets
. Make sure that the audio data is sampled to 16000 kHz.
from datasets import load_dataset, Audio
# English
stream_data = load_dataset("mozilla-foundation/common_voice_13_0", "en", split="test", streaming=True)
stream_data = stream_data.cast_column("audio", Audio(sampling_rate=16000))
en_sample = next(iter(stream_data))["audio"]["array"]
# Arabic
stream_data = load_dataset("mozilla-foundation/common_voice_13_0", "ar", split="test", streaming=True)
stream_data = stream_data.cast_column("audio", Audio(sampling_rate=16000))
ar_sample = next(iter(stream_data))["audio"]["array"]
Then, load the model and processor:
from transformers import Wav2Vec2ForSequenceClassification, AutoFeatureExtractor
import torch
model_id = "facebook/mms-lid-1024"
processor = AutoFeatureExtractor.from_pretrained(model_id)
model = Wav2Vec2ForSequenceClassification.from_pretrained(model_id)
Now process the audio data, pass the processed audio data to the model to classify it into a language, just like we usually do for Wav2Vec2 audio classification models such as ehcalabres/wav2vec2-lg-xlsr-en-speech-emotion-recognition
# English
inputs = processor(en_sample, sampling_rate=16_000, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs).logits
lang_id = torch.argmax(outputs, dim=-1)[0].item()
detected_lang = model.config.id2label[lang_id]
# 'eng'
# Arabic
inputs = processor(ar_sample, sampling_rate=16_000, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs).logits
lang_id = torch.argmax(outputs, dim=-1)[0].item()
detected_lang = model.config.id2label[lang_id]
# 'ara'
To see all the supported languages of a checkpoint, you can print out the language ids as follows:
processor.id2label.values()
For more details, about the architecture please have a look at the official docs.
đģ Usage Examples
Basic Usage
# The following is the complete example code for language identification
from datasets import load_dataset, Audio
from transformers import Wav2Vec2ForSequenceClassification, AutoFeatureExtractor
import torch
# Load audio samples
# English
stream_data = load_dataset("mozilla-foundation/common_voice_13_0", "en", split="test", streaming=True)
stream_data = stream_data.cast_column("audio", Audio(sampling_rate=16000))
en_sample = next(iter(stream_data))["audio"]["array"]
# Arabic
stream_data = load_dataset("mozilla-foundation/common_voice_13_0", "ar", split="test", streaming=True)
stream_data = stream_data.cast_column("audio", Audio(sampling_rate=16000))
ar_sample = next(iter(stream_data))["audio"]["array"]
# Load the model and processor
model_id = "facebook/mms-lid-1024"
processor = AutoFeatureExtractor.from_pretrained(model_id)
model = Wav2Vec2ForSequenceClassification.from_pretrained(model_id)
# Process English audio data
inputs = processor(en_sample, sampling_rate=16_000, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs).logits
lang_id = torch.argmax(outputs, dim=-1)[0].item()
detected_lang = model.config.id2label[lang_id]
print(f"Detected English language: {detected_lang}")
# Process Arabic audio data
inputs = processor(ar_sample, sampling_rate=16_000, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs).logits
lang_id = torch.argmax(outputs, dim=-1)[0].item()
detected_lang = model.config.id2label[lang_id]
print(f"Detected Arabic language: {detected_lang}")
Advanced Usage
# If you want to process multiple audio files in a loop, you can use the following code
from datasets import load_dataset, Audio
from transformers import Wav2Vec2ForSequenceClassification, AutoFeatureExtractor
import torch
model_id = "facebook/mms-lid-1024"
processor = AutoFeatureExtractor.from_pretrained(model_id)
model = Wav2Vec2ForSequenceClassification.from_pretrained(model_id)
languages = ["en", "ar", "fr"] # List of languages to test
for lang in languages:
stream_data = load_dataset("mozilla-foundation/common_voice_13_0", lang, split="test", streaming=True)
stream_data = stream_data.cast_column("audio", Audio(sampling_rate=16000))
sample = next(iter(stream_data))["audio"]["array"]
inputs = processor(sample, sampling_rate=16_000, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs).logits
lang_id = torch.argmax(outputs, dim=-1)[0].item()
detected_lang = model.config.id2label[lang_id]
print(f"Detected {lang} language: {detected_lang}")
⨠Features
- Multilingual Support: This model supports 1024 languages, enabling speech language identification across a wide range of languages.
- Based on Wav2Vec2 Architecture: Utilizes the powerful Wav2Vec2 architecture for efficient audio processing and classification.
đĻ Installation
pip install torch accelerate torchaudio datasets
pip install --upgrade transformers
Note: In order to use MMS you need to have at least transformers >= 4.30
installed. If the 4.30
version is not yet available on PyPI make sure to install transformers
from source:
pip install git+https://github.com/huggingface/transformers.git
đ Documentation
For more details about the architecture, please refer to the official docs.
đ§ Technical Details
This checkpoint is a fine - tuned model for speech language identification (LID). It is based on the Wav2Vec2 architecture and classifies raw audio input to a probability distribution over 1024 output classes, with each class representing a language. The model has been fine - tuned from facebook/mms-1b on 1024 languages.
đ License
This model is released under the CC - BY - NC 4.0 license.
Supported Languages
This model supports 1024 languages. Unclick the following to toogle all supported languages of this checkpoint in ISO 639-3 code. You can find more details about the languages and their ISO 649-3 codes in the MMS Language Coverage Overview.
Click to toggle
- ara
- cmn
- eng
- spa
- fra
- mlg
- swe
- por
- vie
- ful
- sun
- asm
- ben
- zlm
- kor
- ind
- hin
- tuk
- urd
- aze
- slv
- mon
- hau
- tel
- swh
- bod
- rus
- tur
- heb
- mar
- som
- tgl
- tat
- tha
- cat
- ron
- mal
- bel
- pol
- yor
- nld
- bul
- hat
- afr
- isl
- amh
- tam
- hun
- hrv
- lit
- cym
- fas
- mkd
- ell
- bos
- deu
- sqi
- jav
- kmr
- nob
- uzb
- snd
- lat
- nya
- grn
- mya
- orm
- lin
- hye
- yue
- pan
- jpn
- kaz
- npi
- kik
- kat
- guj
- kan
- tgk
- ukr
- ces
- lav
- bak
- khm
- cak
- fao
- glg
- ltz
- xog
- lao
- mlt
- sin
- aka
- sna
- che
- mam
- ita
- quc
- aiw
- srp
- mri
- tuv
- nno
- pus
- eus
- kbp
- gur
- ory
- lug
- crh
- bre
- luo
- nhx
- slk
- ewe
- xsm
- fin
- rif
- dan
- saq
- yid
- yao
- mos
- quh
- hne
- xon
- new
- dtp
- quy
- est
- ddn
- dyu
- ttq
- bam
- pse
- uig
- sck
- ngl
- tso
- mup
- dga
- seh
- lis
- wal
- ctg
- mip
- bfz
- bxk
- ceb
- kru
- war
- khg
- bbc
- thl
- nzi
- vmw
- mzi
- ycl
- zne
- sid
- asa
- tpi
- bmq
- box
- zpu
- gof
- nym
- cla
- bgq
- bfy
- hlb
- qxl
- teo
- fon
- sda
- kfx
- bfa
- mag
- tzh
- pil
- maj
- maa
- kdt
- ksb
- lns
- btd
- rej
- pap
- ayr
- any
- mnk
- adx
- gud
- krc
- onb
- xal
- ctd
- nxq
- ava
- blt
- lbw
- hyw
- udm
- zar
- tzo
- kpv
- san
- xnj
- kek
- chv
- kcg
- kri
- ati
- bgw
- mxt
- ybb
- btx
- dgi
- nhy
- dnj
- zpz
- yba
- lon
- smo
- men
- ium
- mgd
- taq
- nga
- nsu
- zaj
- tly
- prk
- zpt
- akb
- mhr
- mxb
- nuj
- obo
- kir
- bom
- run
- zpg
- hwc
- mnw
- ubl
- kin
- xtm
- hnj
- mpm
- rkt
- miy
- luc
- mih
- kne
- mib
- flr
- myv
- xmm
- knk
- iba
- gux
- pis
- zmz
- ses
- dav
- lif
- qxr
- dig
- kdj
- wsg
- tir
- gbm
- mai
- zpc
- kus
- nyy
- mim
- nan
- nyn
- gog
- ngu
- tbz
- hoc
- nyf
- sus
- guk
- gwr
- yaz
- bcc
- sbd
- spp
- hak
- grt
- kno
- oss
- suk
- spy
- nij
- lsm
- kaa
- bem
- rmy
- kqn
- nim
- ztq
- nus
- bib
- xtd
- ach
- mil
- keo
- mpg
- gjn
- zaq
- kdh
- dug
- sah
- awa
- kff
- dip
- rim
- nhe
- pcm
- kde
- tem
- quz
- mfq
- las
- bba
- kbr
- taj
- dyo
- zao
- lom
- shk
- dik
- dgo
- zpo
- fij
- bgc
- xnr
- bud
- kac
- laj
- mev
- maw
- quw
- kao
- dag
- ktb
- lhu
- zab
- mgh
- shn
- otq
- lob
- pbb
- oci
- zyb
- bsq
- mhi
- dzo
- zas
- guc
- alz
- ctu
- wol
- guw
- mnb
- nia
- zaw
- mxv
- bci
- sba
- kab
- dwr
- nnb
- ilo
- mfe
- srx
- ruf
- srn
- zad
- xpe
- pce
- ahk
- bcl
- myk
- haw
- mad
- ljp
- bky
- gmv
- nag
- nav
- nyo
- kxm
- nod
- sag
- zpl
- sas
- myx
- sgw
- old
- irk
- acf
- mak
- kfy
- zai
- mie
- zpm
- zpi
- ote
- jam
- kpz
- lgg
- lia
- nhi
- mzm
- bdq
- xtn
- mey
- mjl
- sgj
- kdi
- kxc
- miz
- adh
- tap
- hay
- kss
- pam
- gor
- heh
- nhw
- ziw
- gej
- yua
- itv
- shi
- qvw
- mrw
- hil
- mbt
- pag
- vmy
- lwo
- cce
- kum
- klu
- ann
- mbb
- npl
- zca
- pww
- toc
- ace
- mio
- izz
- kam
- zaa
- krj
- bts
- eza
- zty
- hns
- kki
- min
- led
- alw
- tll
- rng
- pko
- toi
- iqw
- ncj
- toh
- umb
- mog
- hno
- wob
- gxx
- hig
- nyu
- kby
- ban
- syl
- bxg
- nse
- xho
- zae
- mkw
- nch
- ibg
- mas
- qvz
- bum
- bgd
- mww
- epo
- tzm
- zul
- bcq
- lrc
- xdy
- tyv
- ibo
- loz
- mza
- abk
- azz
- guz
- arn
- ksw
- lus
- tos
- gvr
- top
- ckb
- mer
- pov
- lun
- rhg
- knc
- sfw
- bev
- tum
- lag
- nso
- bho
- ndc
- maf
- gkp
- bax
- awn
- ijc
- qug
- lub
- srr
- mni
- zza
- ige
- dje
- mkn
- bft
- tiv
- otn
- kck
- kqs
- gle
- lua
- pdt
- swk
- mgw
- ebu
- ada
- lic
- skr
- gaa
- mfa
- vmk
- mcn
- bto
- lol
- bwr
- unr
- dzg
- hdy
- kea
- bhi
- glk
- mua
- ast
- nup
- sat
- ktu
- bhb
- zpq
- coh
- bkm
- gya
- sgc
- dks
- ncl
- tui
- emk
- urh
- ego
- ogo
- tsc
- idu
- igb
- ijn
- njz
- ngb
- tod
- jra
- mrt
- zav
- tke
- its
- ady
- bzw
- kng
- kmb
- lue
- jmx
- tsn
- bin
- ble
- gom
- ven
- sef
- sco
- her
- iso
- trp
- glv
- haq
- toq
- okr
- kha
- wof
- rmn
- sot
- kaj
- bbj
- sou
- mjt
- trd
- gno
- mwn
- igl
- rag
- eyo
- div
- efi
- nde
- mfv
- mix
- rki
- kjg
- fan
- khw
- wci
- bjn
- pmy
- bqi
- ina
- hni
- mjx
- kuj
- aoz
- the
- tog
- tet
- nuz
- ajg
- ccp
- mau
- ymm
- fmu
- tcz
- xmc
- nyk
- ztg
- knx
- snk
- zac
- esg
- srb
- thq
- pht
- wes
- rah
- pnb
- ssy
- zpv
- kpo
- phr
- atd
- eto
- xta
- mxx
- mui
- uki
- tkt
- mgp
- xsq
- enq
- nnh
- qxp
- zam
- bug
- bxr
- maq
- tdt
- khb
- mrr
- kas
- zgb
- kmw
- lir
- vah
- dar
- ssw
- hmd
- jab
- iii
- peg
- shr
- brx
- rwr
- bmb
- kmc
- mji
- dib
- pcc
- nbe
- mrd
- ish
- kai
- yom
- zyn
- hea
- ewo
- bas
- hms
- twh
- kfq
- thr
- xtl
- wbr
- bfb
- wtm
- mjc
- blk
- lot
- dhd
- swv
- wbm
- zzj
- kge
- mgm
- niq
- zpj
- bwx
- bde
- mtr
- gju
- kjp
- mbz
- haz
- lpo
- yig
- qud
- shy
- gjk
- ztp
- nbl
- aii
- kun
- say
- mde
- sjp
- bns
- brh
- ywq
- msi
- anr
- mrg
- mjg
- tan
- tsg
- tcy
- kbl
- mdr
- mks
- noe
- tyz
- zpa
- ahr
- aar
- wuu
- khr
- kbd
- kex
- bca
- nku
- pwr
- hsn
- ort
- ott
- swi
- kua
- tdd
- msm
- bgp
- nbm
- mxy
- abs
- zlj
- ebo
- lea
- dub
- sce
- xkb
- vav
- bra
- ssb
- sss
- nhp
- kad
- kvx
- lch
- tts
- zyj
- kxp
- lmn
- qvi
- lez
- scl
- cqd
- ayb
- xbr
- nqg
- dcc
- cjk
- bfr
- zyg
- mse
- gru
- mdv
- bew
- wti
- arg
- dso
- zdj
- pll
- mig
- qxs
- bol
- drs
- anp
- chw
- bej
- vmc
- otx
- xty
- bjj
- vmz
- ibb
- gby
- twx
- tig
- thz
- tku
- hmz
- pbm
- mfn
- nut
- cyo
- mjw
- cjm
- tlp
- naq
- rnd
- stj
- sym
- jax
- btg
- tdg
- sng
- nlv
- kvr
- pch
- fvr
- mxs
- wni
- mlq
- kfr
- mdj
- osi
- nhn
- ukw
- tji
- qvj
- nih
- bcy
- hbb
- zpx
- hoj
- cpx
- ogc
- cdo
- bgn
- bfs
- vmx
- tvn
- ior
- mxa
- btm
- anc
- jit
- mfb
- mls
- ets
- goa
- bet
- ikw
- pem
- trf
- daq
- max
- rad
- njo
- bnx
- mxl
- mbi
- nba
- zpn
- zts
- mut
- hnd
- mta
- hav
- hac
- ryu
- abr
- yer
- cld
- zag
- ndo
- sop
- vmm
- gcf
- chr
- cbk
- sbk
- bhp
- odk
- mbd
- nap
- gbr
- mii
- czh
- xti
- vls
- gdx
- sxw
- zaf
- wem
- mqh
- ank
- yaf
- vmp
- otm
- sdh
- anw
- src
- mne
- wss
- meh
- kzc
- tma
- ttj
- ots
- ilp
- zpr
- saz
- ogb
- akl
- nhg
- pbv
- rcf
- cgg
- mku
- bez
- mwe
- mtb
- gul
- ifm
- mdh
- scn
- lki
- xmf
- sgd
- aba
- cos
- luz
- zpy
- stv
- kjt
- mbf
- kmz
- nds
- mtq
- tkq
- aee
- knn
- mbs
- mnp
- ema
- bar
- unx
- plk
- psi
- mzn
- cja
- sro
- mdw
- ndh
- vmj
- zpw
- kfu
- bgx
- gsw
- fry
- zpe
- zpd
- bta
- psh
- zat
Model details
Property | Details |
---|---|
Developed by | Vineel Pratap et al. |
Model Type | Multi - Lingual Automatic Speech Recognition model |
Language(s) | 1024 languages, see supported languages |
License | CC - BY - NC 4.0 license |
Num parameters | 1 billion |
Audio sampling rate | 16,000 kHz |
Cite as | @article{pratap2023mms, title={Scaling Speech Technology to 1,000+ Languages}, author={Vineel Pratap and Andros Tjandra and Bowen Shi and Paden Tomasello and Arun Babu and Sayani Kundu and Ali Elkahky and Zhaoheng Ni and Apoorv Vyas and Maryam Fazel - Zarandi and Alexei Baevski and Yossi Adi and Xiaohui Zhang and Wei - Ning Hsu and Alexis Conneau and Michael Auli}, journal={arXiv}, year={2023} } |
Additional Links







