The open-source speech and language recognition model mms - lid - 4017 can be deployed for free and recognize 4017 languages!

Mms Lid 4017

Developed by facebook

This is a speech language identification model based on the Wav2Vec2 architecture, capable of recognizing 4017 languages, and is part of Facebook's Massively Multilingual Speech project.

Audio Classification

Transformers

Supports Multiple Languages#4017 language identification #1 billion parameter speech model #Multimodal speech processing

Downloads 3,721

Release Time : 6/13/2023

Model Overview

This model is used for speech language identification tasks, classifying raw audio input into probability distributions across 4017 languages.

Model Features

Extensive language support

Capable of identifying 4017 different languages, covering the vast majority of global languages

Large-scale pretraining

Fine-tuned based on the 1 billion parameter Wav2Vec2 architecture

High accuracy

Excellent performance on various language identification tasks

Model Capabilities

Speech language identification

Multilingual audio classification

Real-time language detection

Use Cases

Speech technology

Multilingual voice assistants

Used to identify the language of user speech to switch the voice assistant's language mode

Improves voice assistant adaptability in multilingual environments

Content classification

Language classification of audio content

Helps content platforms automatically categorize multilingual audio content

Research applications

Linguistic research

Used for analyzing language distribution and language identification research

Supports large-scale linguistic research projects

🚀 Massively Multilingual Speech (MMS) - Finetuned LID

This model is fine - tuned for speech language identification (LID), enabling accurate recognition of spoken languages across a wide range of linguistic diversity.

🚀 Quick Start

This MMS checkpoint can be used with Transformers to identify the spoken language of an audio. It can recognize 4017 languages.

First, we install transformers and some other libraries:

pip install torch accelerate torchaudio datasets
pip install --upgrade transformers

Note: In order to use MMS you need to have at least transformers >= 4.30 installed. If the 4.30 version is not yet available on PyPI make sure to install transformers from source:

pip install git+https://github.com/huggingface/transformers.git

Next, we load a couple of audio samples via datasets. Make sure that the audio data is sampled to 16000 kHz.

from datasets import load_dataset, Audio

# English
stream_data = load_dataset("mozilla-foundation/common_voice_13_0", "en", split="test", streaming=True)
stream_data = stream_data.cast_column("audio", Audio(sampling_rate=16000))
en_sample = next(iter(stream_data))["audio"]["array"]

# Arabic
stream_data = load_dataset("mozilla-foundation/common_voice_13_0", "ar", split="test", streaming=True)
stream_data = stream_data.cast_column("audio", Audio(sampling_rate=16000))
ar_sample = next(iter(stream_data))["audio"]["array"]

Next, we load the model and processor:

from transformers import Wav2Vec2ForSequenceClassification, AutoFeatureExtractor
import torch

model_id = "facebook/mms-lid-4017"

processor = AutoFeatureExtractor.from_pretrained(model_id)
model = Wav2Vec2ForSequenceClassification.from_pretrained(model_id)

Now we process the audio data, pass the processed audio data to the model to classify it into a language, just like we usually do for Wav2Vec2 audio classification models such as ehcalabres/wav2vec2-lg-xlsr-en-speech-emotion-recognition:

# English
inputs = processor(en_sample, sampling_rate=16_000, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs).logits

lang_id = torch.argmax(outputs, dim=-1)[0].item()
detected_lang = model.config.id2label[lang_id]
# 'eng'

# Arabic
inputs = processor(ar_sample, sampling_rate=16_000, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs).logits

lang_id = torch.argmax(outputs, dim=-1)[0].item()
detected_lang = model.config.id2label[lang_id]
# 'ara'

To see all the supported languages of a checkpoint, you can print out the language ids as follows:

processor.id2label.values()

For more details about the architecture, please have a look at the official docs.

✨ Features

Multilingual Support: This model supports 4017 languages, allowing for extensive language recognition across the globe.
Fine - Tuned for LID: It is fine - tuned for speech language identification, providing accurate results for spoken language classification.

📦 Installation

pip install torch accelerate torchaudio datasets
pip install --upgrade transformers

If the 4.30 version of transformers is not available on PyPI, install it from source:

pip install git+https://github.com/huggingface/transformers.git

💻 Usage Examples

Basic Usage

# First, install necessary libraries
pip install torch accelerate torchaudio datasets
pip install --upgrade transformers

# Load audio samples
from datasets import load_dataset, Audio
stream_data = load_dataset("mozilla-foundation/common_voice_13_0", "en", split="test", streaming=True)
stream_data = stream_data.cast_column("audio", Audio(sampling_rate=16000))
en_sample = next(iter(stream_data))["audio"]["array"]

# Load model and processor
from transformers import Wav2Vec2ForSequenceClassification, AutoFeatureExtractor
import torch
model_id = "facebook/mms-lid-4017"
processor = AutoFeatureExtractor.from_pretrained(model_id)
model = Wav2Vec2ForSequenceClassification.from_pretrained(model_id)

# Process audio and classify language
inputs = processor(en_sample, sampling_rate=16_000, return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs).logits
lang_id = torch.argmax(outputs, dim=-1)[0].item()
detected_lang = model.config.id2label[lang_id]

Advanced Usage

# You can loop through multiple audio samples to classify languages in batch
from datasets import load_dataset, Audio
from transformers import Wav2Vec2ForSequenceClassification, AutoFeatureExtractor
import torch

model_id = "facebook/mms-lid-4017"
processor = AutoFeatureExtractor.from_pretrained(model_id)
model = Wav2Vec2ForSequenceClassification.from_pretrained(model_id)

languages = ["en", "ar"]
for lang in languages:
    stream_data = load_dataset("mozilla-foundation/common_voice_13_0", lang, split="test", streaming=True)
    stream_data = stream_data.cast_column("audio", Audio(sampling_rate=16000))
    sample = next(iter(stream_data))["audio"]["array"]
    inputs = processor(sample, sampling_rate=16_000, return_tensors="pt")
    with torch.no_grad():
        outputs = model(**inputs).logits
    lang_id = torch.argmax(outputs, dim=-1)[0].item()
    detected_lang = model.config.id2label[lang_id]
    print(f"Detected language for {lang}: {detected_lang}")

📚 Documentation

This checkpoint is a model fine - tuned for speech language identification (LID) and part of Facebook's Massive Multilingual Speech project. It is based on the Wav2Vec2 architecture and classifies raw audio input to a probability distribution over 4017 output classes (each class representing a language). The checkpoint consists of 1 billion parameters and has been fine - tuned from facebook/mms-1b on 4017 languages.

🔧 Technical Details

Model Architecture: Based on the Wav2Vec2 architecture.
Parameters: The checkpoint consists of 1 billion parameters.
Fine - Tuning: Fine - tuned from facebook/mms-1b on 4017 languages.

📄 License

This model is licensed under cc-by-nc-4.0.

📋 Additional Information

Supported Languages

This model supports 4017 languages. Unclick the following to toggle all supported languages of this checkpoint in ISO 639-3 code. You can find more details about the languages and their ISO 649 - 3 codes in the MMS Language Coverage Overview.

Click to toggle

Datasets

google/fleurs

Metrics

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご