đ Massively Multilingual Speech (MMS) - Finetuned LID
This project offers a fine - tuned model for speech language identification (LID) as part of Facebook's Massive Multilingual Speech project. It can classify raw audio input into 256 different languages, providing a powerful tool for multilingual speech processing.
đ Quick Start
This MMS checkpoint can be used with Transformers to identify the spoken language of an audio. It can recognize the following 256 languages.
Let's look at a simple example.
Basic Usage
First, we install transformers and some other libraries
pip install torch accelerate torchaudio datasets
pip install --upgrade transformers
Note: In order to use MMS you need to have at least transformers >= 4.30
installed. If the 4.30
version is not yet available on PyPI make sure to install transformers
from source:
pip install git+https://github.com/huggingface/transformers.git
Next, we load a couple of audio samples via datasets
. Make sure that the audio data is sampled to 16000 kHz.
from datasets import load_dataset, Audio
stream_data = load_dataset("mozilla-foundation/common_voice_13_0", "en", split="test", streaming=True)
stream_data = stream_data.cast_column("audio", Audio(sampling_rate=16000))
en_sample = next(iter(stream_data))["audio"]["array"]
stream_data = load_dataset("mozilla-foundation/common_voice_13_0", "ar", split="test", streaming=True)
stream_data = stream_data.cast_column("audio", Audio(sampling_rate=16000))
ar_sample = next(iter(stream_data))["audio"]["array"]
Next, we load the model and processor
from transformers import Wav2Vec2ForSequenceClassification, AutoFeatureExtractor
import torch
model_id = "facebook/mms-lid-256"
processor = AutoFeatureExtractor.from_pretrained(model_id)
model = Wav2Vec2ForSequenceClassification.from_pretrained(model_id)
Now we process the audio data, pass the processed audio data to the model to classify it into a language, just like we usually do for Wav2Vec2 audio classification models such as ehcalabres/wav2vec2-lg-xlsr-en-speech-emotion-recognition
inputs = processor(en_sample, sampling_rate=16_000, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs).logits
lang_id = torch.argmax(outputs, dim=-1)[0].item()
detected_lang = model.config.id2label[lang_id]
inputs = processor(ar_sample, sampling_rate=16_000, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs).logits
lang_id = torch.argmax(outputs, dim=-1)[0].item()
detected_lang = model.config.id2label[lang_id]
To see all the supported languages of a checkpoint, you can print out the language ids as follows:
processor.id2label.values()
For more details, about the architecture please have a look at the official docs.
⨠Features
- Multilingual Support: This model can classify raw audio input into a probability distribution over 256 output classes, each representing a language.
- Based on Wav2Vec2: It is based on the Wav2Vec2 architecture, which provides a solid foundation for speech processing.
đĻ Installation
pip install torch accelerate torchaudio datasets
pip install --upgrade transformers
If the 4.30
version of transformers
is not yet available on PyPI, install it from source:
pip install git+https://github.com/huggingface/transformers.git
đ Documentation
Supported Languages
This model supports 256 languages. Unclick the following to toogle all supported languages of this checkpoint in ISO 639 - 3 code.
You can find more details about the languages and their ISO 649 - 3 codes in the MMS Language Coverage Overview.
Click to toggle
- ara
- cmn
- eng
- spa
- fra
- mlg
- swe
- por
- vie
- ful
- sun
- asm
- ben
- zlm
- kor
- ind
- hin
- tuk
- urd
- aze
- slv
- mon
- hau
- tel
- swh
- bod
- rus
- tur
- heb
- mar
- som
- tgl
- tat
- tha
- cat
- ron
- mal
- bel
- pol
- yor
- nld
- bul
- hat
- afr
- isl
- amh
- tam
- hun
- hrv
- lit
- cym
- fas
- mkd
- ell
- bos
- deu
- sqi
- jav
- kmr
- nob
- uzb
- snd
- lat
- nya
- grn
- mya
- orm
- lin
- hye
- yue
- pan
- jpn
- kaz
- npi
- kik
- kat
- guj
- kan
- tgk
- ukr
- ces
- lav
- bak
- khm
- fao
- glg
- ltz
- xog
- lao
- mlt
- sin
- aka
- sna
- ita
- srp
- mri
- nno
- pus
- eus
- ory
- lug
- bre
- luo
- slk
- ewe
- fin
- rif
- dan
- yid
- yao
- mos
- hne
- est
- dyu
- bam
- uig
- sck
- tso
- mup
- ctg
- ceb
- war
- bbc
- vmw
- sid
- tpi
- mag
- san
- kri
- lon
- kir
- run
- ubl
- kin
- rkt
- xmm
- tir
- mai
- nan
- nyn
- bcc
- hak
- suk
- bem
- rmy
- awa
- pcm
- bgc
- shn
- oci
- wol
- bci
- kab
- ilo
- bcl
- haw
- mad
- nod
- sag
- sas
- jam
- mey
- shi
- hil
- ace
- kam
- min
- umb
- hno
- ban
- syl
- bxg
- xho
- mww
- epo
- tzm
- zul
- ibo
- abk
- guz
- ckb
- knc
- nso
- bho
- dje
- tiv
- gle
- lua
- skr
- bto
- kea
- glk
- ast
- sat
- ktu
- bhb
- emk
- kng
- kmb
- tsn
- gom
- ven
- sco
- glv
- sot
- sou
- gno
- nde
- bjn
- ina
- fmu
- esg
- wes
- pnb
- phr
- mui
- bug
- mrr
- kas
- lir
- vah
- ssw
- rwr
- pcc
- hms
- wbr
- swv
- mtr
- haz
- aii
- bns
- msi
- wuu
- hsn
- bgp
- tts
- lmn
- dcc
- bew
- bjj
- ibb
- tji
- hoj
- cpx
- cdo
- daq
- mut
- nap
- czh
- gdx
- sdh
- scn
- mnp
- bar
- mzn
- gsw
Model details
Property |
Details |
Developed by |
Vineel Pratap et al. |
Model Type |
Multi - Lingual Automatic Speech Recognition model |
Language(s) |
256 languages, see supported languages |
License |
CC - BY - NC 4.0 license |
Num parameters |
1 billion |
Audio sampling rate |
16,000 kHz |
Cite as |
@article{pratap2023mms, title={Scaling Speech Technology to 1,000+ Languages}, author={Vineel Pratap and Andros Tjandra and Bowen Shi and Paden Tomasello and Arun Babu and Sayani Kundu and Ali Elkahky and Zhaoheng Ni and Apoorv Vyas and Maryam Fazel - Zarandi and Alexei Baevski and Yossi Adi and Xiaohui Zhang and Wei - Ning Hsu and Alexis Conneau and Michael Auli}, journal={arXiv}, year={2023}} |
Additional Links
đ License
This project is licensed under the CC - BY - NC 4.0 license.