đ Massively Multilingual Speech (MMS) - Finetuned LID
This project offers a fine - tuned model for speech language identification (LID). It's part of Facebook's Massive Multilingual Speech project, based on the Wav2Vec2 architecture. The model classifies raw audio input into a probability distribution over 126 languages, with 1 billion parameters and is fine - tuned from facebook/mms-1b.
đ Quick Start
This MMS checkpoint can be used with Transformers to identify the spoken language of an audio. It can recognize 126 languages.
Installation
First, we need to install some necessary libraries:
pip install torch accelerate torchaudio datasets
pip install --upgrade transformers
â ī¸ Important Note
In order to use MMS you need to have at least transformers >= 4.30
installed. If the 4.30
version is not yet available on PyPI make sure to install transformers
from source:
pip install git+https://github.com/huggingface/transformers.git
Usage Examples
Basic Usage
Next, we load audio samples, the model, and the processor, and then classify the audio into languages.
from datasets import load_dataset, Audio
stream_data = load_dataset("mozilla-foundation/common_voice_13_0", "en", split="test", streaming=True)
stream_data = stream_data.cast_column("audio", Audio(sampling_rate=16000))
en_sample = next(iter(stream_data))["audio"]["array"]
stream_data = load_dataset("mozilla-foundation/common_voice_13_0", "ar", split="test", streaming=True)
stream_data = stream_data.cast_column("audio", Audio(sampling_rate=16000))
ar_sample = next(iter(stream_data))["audio"]["array"]
from transformers import Wav2Vec2ForSequenceClassification, AutoFeatureExtractor
import torch
model_id = "facebook/mms-lid-126"
processor = AutoFeatureExtractor.from_pretrained(model_id)
model = Wav2Vec2ForSequenceClassification.from_pretrained(model_id)
inputs = processor(en_sample, sampling_rate=16_000, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs).logits
lang_id = torch.argmax(outputs, dim=-1)[0].item()
detected_lang = model.config.id2label[lang_id]
inputs = processor(ar_sample, sampling_rate=16_000, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs).logits
lang_id = torch.argmax(outputs, dim=-1)[0].item()
detected_lang = model.config.id2label[lang_id]
To see all the supported languages of a checkpoint, you can print out the language ids as follows:
processor.id2label.values()
For more details about the architecture, please refer to the official docs.
⨠Features
- Multilingual Support: This model supports 126 languages, providing a wide - range of language identification capabilities.
- Based on Wav2Vec2: Leveraging the advanced Wav2Vec2 architecture for accurate audio classification.
đĻ Supported Languages
This model supports 126 languages. Click the following to toggle all supported languages of this checkpoint in ISO 639 - 3 code. You can find more details about the languages and their ISO 649 - 3 codes in the MMS Language Coverage Overview.
Click to toggle
- ara
- cmn
- eng
- spa
- fra
- mlg
- swe
- por
- vie
- ful
- sun
- asm
- ben
- zlm
- kor
- ind
- hin
- tuk
- urd
- aze
- slv
- mon
- hau
- tel
- swh
- bod
- rus
- tur
- heb
- mar
- som
- tgl
- tat
- tha
- cat
- ron
- mal
- bel
- pol
- yor
- nld
- bul
- hat
- afr
- isl
- amh
- tam
- hun
- hrv
- lit
- cym
- fas
- mkd
- ell
- bos
- deu
- sqi
- jav
- nob
- uzb
- snd
- lat
- nya
- grn
- mya
- orm
- lin
- hye
- yue
- pan
- jpn
- kaz
- npi
- kat
- guj
- kan
- tgk
- ukr
- ces
- lav
- bak
- khm
- fao
- glg
- ltz
- lao
- mlt
- sin
- sna
- ita
- srp
- mri
- nno
- pus
- eus
- ory
- lug
- bre
- luo
- slk
- fin
- dan
- yid
- est
- ceb
- war
- san
- kir
- oci
- wol
- haw
- kam
- umb
- xho
- epo
- zul
- ibo
- abk
- ckb
- nso
- gle
- kea
- ast
- sco
- glv
- ina
đ Model details
Property |
Details |
Developed by |
Vineel Pratap et al. |
Model Type |
Multi - Lingual Automatic Speech Recognition model |
Language(s) |
126 languages, see supported languages |
License |
CC - BY - NC 4.0 license |
Num parameters |
1 billion |
Audio sampling rate |
16,000 kHz |
Cite as |
@article{pratap2023mms, title={Scaling Speech Technology to 1,000+ Languages}, author={Vineel Pratap and Andros Tjandra and Bowen Shi and Paden Tomasello and Arun Babu and Sayani Kundu and Ali Elkahky and Zhaoheng Ni and Apoorv Vyas and Maryam Fazel - Zarandi and Alexei Baevski and Yossi Adi and Xiaohui Zhang and Wei - Ning Hsu and Alexis Conneau and Michael Auli}, journal={arXiv}, year={2023} } |
đ Additional Links