Meta AI Releases Open-Source mctct-large Speech Recognition Model - Supports Character-Level Transcription in 60 Languages

Mctct Large

Developed by cwkeam

A large-scale multilingual speech recognition model introduced by Meta AI, featuring 1 billion parameters and supporting character-level transcription for 60 languages

Speech Recognition

Transformers

EnglishOpen Source License:Apache-2.0 #Multilingual Speech Recognition #Character-level Transcription #Large-scale Transformer

Downloads 21

Release Time : 5/5/2022

Model Overview

M-CTC-T is a large-scale multilingual speech recognition model based on a Transformer encoder, equipped with a CTC head and a language identification head. It can process speech input in 60 languages and output character-level transcribed text (preserving punctuation and capitalization).

Model Features

Multilingual Support

Supports speech recognition in 60 languages with language identification capability

Large-scale Training

Based on a Transformer architecture with 1 billion parameters, trained on data from Common Voice and VoxPopuli

Character-level Transcription

Output preserves the original text's punctuation and capitalization format

End-to-End Model

Directly recognizes from 16kHz audio signals using Mel filterbank features

Model Capabilities

Multilingual Speech Recognition

Language Identification

Character-level Text Transcription

Use Cases

Speech-to-Text

Automatic Meeting Transcription

Automatically converts multilingual meeting recordings into text transcripts

Voice Assistants

Supports multilingual voice command recognition

Speech Analysis

Multilingual Content Analysis

Analyzes speech content in different languages

🚀 M-CTC-T

A massively multilingual speech recognizer from Meta AI. This model is a powerful tool for speech recognition across multiple languages, offering high - performance capabilities.

✨ Features

Multilingual Support: Capable of recognizing speech in multiple languages.
Model Architecture: It is a 1B - param transformer encoder, equipped with a CTC head over 8065 character labels and a language identification head over 60 language ID labels.
Training Data: Trained on datasets like Common Voice (version 6.1, December 2020 release), VoxPopuli, and later only on Common Voice. The labels are unnormalized character - level transcripts.
Input Requirement: Takes Mel filterbank features from a 16Khz audio signal as input.

model image

The original Flashlight code, model checkpoints, and Colab notebook can be found at https://github.com/flashlight/wav2letter/tree/main/recipes/mling_pl.

📦 Installation

No specific installation steps are provided in the original document, so this section is skipped.

💻 Usage Examples

Basic Usage

import torch
import torchaudio
from datasets import load_dataset
from transformers import MCTCTForCTC, MCTCTProcessor

model = MCTCTForCTC.from_pretrained("speechbrain/mctct-large")
processor = MCTCTProcessor.from_pretrained("speechbrain/mctct-large")

 # load dummy dataset and read soundfiles
ds = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")
 
# tokenize
input_features = processor(ds[0]["audio"]["array"], return_tensors="pt").input_features 

# retrieve logits
logits = model(input_features).logits

# take argmax and decode
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)

Advanced Usage

No advanced usage code example is provided in the original document, so this part is skipped.

Results for Common Voice, averaged over all languages:

Character error rate (CER):

Valid	Test
21.4	23.3

📚 Documentation

For more information on how the model was trained, please take a look at the official paper.

model image TO - DO: replace with the training diagram from paper

🔧 Technical Details

The model is a 1B - param transformer encoder, with a CTC head over 8065 character labels and a language identification head over 60 language ID labels. It is trained on specific datasets and takes Mel filterbank features from a 16Khz audio signal as input.

📄 License

This project is licensed under the apache - 2.0 license.

📄 Citation

Paper

Authors: Loren Lugosch, Tatiana Likhomanenko, Gabriel Synnaeve, Ronan Collobert

@article{lugosch2021pseudo,
  title={Pseudo-Labeling for Massively Multilingual Speech Recognition},
  author={Lugosch, Loren and Likhomanenko, Tatiana and Synnaeve, Gabriel and Collobert, Ronan},
  journal={ICASSP},
  year={2022}
}

Additional thanks to Chan Woo Kim and Patrick von Platen for porting the model from Flashlight to PyTorch.

Property	Details
Model Type	A 1B - param transformer encoder with CTC and language identification heads
Training Data	Common Voice (version 6.1, December 2020 release), VoxPopuli, and later only Common Voice

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご