MMS-LID-512 open-source speech recognition model accurately identifies audio language categories of 512 languages!

Mms Lid 512

Developed by facebook

This is a fine-tuned model for speech language identification (LID) across 512 languages, based on the Wav2Vec2 architecture, capable of recognizing the language category of input audio.

Speech Recognition

Transformers

Supports Multiple Languages#512 Language Identification #1 Billion Parameter Speech Model #Multilingual Speech Classification

Downloads 32

Release Time : 6/13/2023

Model Overview

This model is part of Facebook's Massively Multilingual Speech project, classifying raw audio input into probability distributions across 512 language categories. The model contains 1 billion parameters and is suitable for multilingual speech recognition tasks.

Model Features

Multilingual Support

Supports speech recognition for 512 languages, covering most major languages and dialects worldwide.

Large-Scale Pretraining

Based on the 1-billion-parameter Wav2Vec2 architecture, fine-tuned from the facebook/mms-1b model.

High Accuracy

Performs excellently across multiple languages, accurately identifying the language of audio input.

Model Capabilities

Speech Language Identification

Multilingual Audio Classification

Real-Time Speech Processing

Use Cases

Speech Technology

Multilingual Voice Assistants

Used to identify the language of user voice input for switching to the corresponding language processing module.

Improves accuracy and user experience of voice assistants in multilingual environments

Speech Content Classification

Automatically identifies the language category of audio content for content management and classification.

Enables automatic classification of multilingual audio content

Educational Technology

Language Learning Applications

Helps language learners identify and practice pronunciation in different languages.

Provides more accurate language identification feedback

🚀 Massively Multilingual Speech (MMS) - Finetuned LID

This checkpoint is a model fine-tuned for speech language identification (LID), which is part of Facebook's Massive Multilingual Speech project. It can classify raw audio input into a probability distribution over 512 output classes, with each class representing a language.

🚀 Quick Start

This MMS checkpoint can be used with Transformers to identify the spoken language of an audio. It can recognize the following 512 languages.

Let's look at a simple example.

First, we install transformers and some other libraries

pip install torch accelerate torchaudio datasets
pip install --upgrade transformers

Note: In order to use MMS you need to have at least transformers >= 4.30 installed. If the 4.30 version is not yet available on PyPI make sure to install transformers from source:

pip install git+https://github.com/huggingface/transformers.git

Next, we load a couple of audio samples via datasets. Make sure that the audio data is sampled to 16000 kHz.

from datasets import load_dataset, Audio

# English
stream_data = load_dataset("mozilla-foundation/common_voice_13_0", "en", split="test", streaming=True)
stream_data = stream_data.cast_column("audio", Audio(sampling_rate=16000))
en_sample = next(iter(stream_data))["audio"]["array"]

# Arabic
stream_data = load_dataset("mozilla-foundation/common_voice_13_0", "ar", split="test", streaming=True)
stream_data = stream_data.cast_column("audio", Audio(sampling_rate=16000))
ar_sample = next(iter(stream_data))["audio"]["array"]

Next, we load the model and processor

from transformers import Wav2Vec2ForSequenceClassification, AutoFeatureExtractor
import torch

model_id = "facebook/mms-lid-512"

processor = AutoFeatureExtractor.from_pretrained(model_id)
model = Wav2Vec2ForSequenceClassification.from_pretrained(model_id)

Now we process the audio data, pass the processed audio data to the model to classify it into a language, just like we usually do for Wav2Vec2 audio classification models such as ehcalabres/wav2vec2-lg-xlsr-en-speech-emotion-recognition

# English
inputs = processor(en_sample, sampling_rate=16_000, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs).logits

lang_id = torch.argmax(outputs, dim=-1)[0].item()
detected_lang = model.config.id2label[lang_id]
# 'eng'

# Arabic
inputs = processor(ar_sample, sampling_rate=16_000, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs).logits

lang_id = torch.argmax(outputs, dim=-1)[0].item()
detected_lang = model.config.id2label[lang_id]
# 'ara'

To see all the supported languages of a checkpoint, you can print out the language ids as follows:

processor.id2label.values()

For more details, about the architecture please have a look at the official docs.

✨ Features

Multilingual Support: This model supports 512 languages, providing wide - ranging language recognition capabilities.
Fine - tuned for LID: Specifically fine - tuned for speech language identification, ensuring high accuracy in language classification.

💻 Usage Examples

Basic Usage

from datasets import load_dataset, Audio
from transformers import Wav2Vec2ForSequenceClassification, AutoFeatureExtractor
import torch

# Load audio samples
stream_data = load_dataset("mozilla-foundation/common_voice_13_0", "en", split="test", streaming=True)
stream_data = stream_data.cast_column("audio", Audio(sampling_rate=16000))
en_sample = next(iter(stream_data))["audio"]["array"]

model_id = "facebook/mms-lid-512"
processor = AutoFeatureExtractor.from_pretrained(model_id)
model = Wav2Vec2ForSequenceClassification.from_pretrained(model_id)

inputs = processor(en_sample, sampling_rate=16_000, return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs).logits

lang_id = torch.argmax(outputs, dim=-1)[0].item()
detected_lang = model.config.id2label[lang_id]

Advanced Usage

# You can load multiple audio samples from different languages and process them in a loop
languages = ["en", "ar"]
for lang in languages:
    stream_data = load_dataset("mozilla-foundation/common_voice_13_0", lang, split="test", streaming=True)
    stream_data = stream_data.cast_column("audio", Audio(sampling_rate=16000))
    sample = next(iter(stream_data))["audio"]["array"]

    inputs = processor(sample, sampling_rate=16_000, return_tensors="pt")
    with torch.no_grad():
        outputs = model(**inputs).logits

    lang_id = torch.argmax(outputs, dim=-1)[0].item()
    detected_lang = model.config.id2label[lang_id]
    print(f"Detected language for {lang}: {detected_lang}")

📚 Documentation

Supported Languages

This model supports 512 languages. Unclick the following to toogle all supported languages of this checkpoint in ISO 639-3 code. You can find more details about the languages and their ISO 649-3 codes in the MMS Language Coverage Overview.

Click to toggle

Model details

Property	Details
Developed by	Vineel Pratap et al.
Model Type	Multi - Lingual Automatic Speech Recognition model
Language(s)	512 languages, see supported languages
License	CC - BY - NC 4.0 license
Num parameters	1 billion
Audio sampling rate	16,000 kHz
Cite as	`@article{pratap2023mms,<br> title={Scaling Speech Technology to 1,000+ Languages},<br> author={Vineel Pratap and Andros Tjandra and Bowen Shi and Paden Tomasello and Arun Babu and Sayani Kundu and Ali Elkahky and Zhaoheng Ni and Apoorv Vyas and Maryam Fazel - Zarandi and Alexei Baevski and Yossi Adi and Xiaohui Zhang and Wei - Ning Hsu and Alexis Conneau and Michael Auli},<br> journal={arXiv},<br> year={2023}<br>}`

Additional Links

Blog post
Transformers documentation.
Paper
GitHub Repository
Other MMS checkpoints
MMS base checkpoints:
- facebook/mms-1b
- facebook/mms-300m
Official Space

📄 License

This model is released under the CC - BY - NC 4.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご