m-ctc-t-large Open-source Speech Recognition Model - Free Support for Speech Recognition and Transcription in 60 Languages

M Ctc T Large

Developed by speechbrain

A large-scale multilingual speech recognition model introduced by Meta AI, supporting 60 languages, based on a 1-billion-parameter Transformer encoder architecture.

Speech Recognition

Transformers

EnglishOpen Source License:Apache-2.0 #Multilingual Speech Recognition #Character-level Transcription #Large-scale Transformer

Downloads 88

Release Time : 5/27/2022

Model Overview

M-CTC-T is a multilingual speech recognition model capable of converting speech to text, supporting multiple languages while preserving punctuation and capitalization.

Model Features

Multilingual Support

Supports speech recognition for 60 languages, covering a wide range of linguistic needs.

Large-scale Training Data

Trained on the Common Voice and VoxPopuli corpora, featuring extensive and diverse datasets.

Character-level Transcription

Uses unnormalized character-level transcription text, preserving punctuation and capitalization.

Model Capabilities

Speech Recognition

Multilingual Transcription

Character-level Text Generation

Use Cases

Speech Transcription

Multilingual Speech-to-Text

Converts speech in multiple languages to text, suitable for international application scenarios.

Character Error Rate (CER) of 21.4-23.3 on the Common Voice test set

🚀 M-CTC-T

Massively multilingual speech recognizer from Meta AI. This model can effectively handle speech recognition tasks across multiple languages.

🚀 Quick Start

M-CTC-T is a massively multilingual speech recognizer developed by Meta AI. The model is a 1B-param transformer encoder, equipped with a CTC head over 8065 character labels and a language identification head over 60 language ID labels. It is trained on Common Voice (version 6.1, December 2020 release) and VoxPopuli. After initial training on both Common Voice and VoxPopuli, the model undergoes further training on Common Voice only. The labels are unnormalized character - level transcripts, meaning punctuation and capitalization are retained. The model takes Mel filterbank features from a 16Khz audio signal as input.

model image

The original Flashlight code, model checkpoints, and Colab notebook can be found at https://github.com/flashlight/wav2letter/tree/main/recipes/mling_pl.

✨ Features

Multilingual Support: Supports multiple languages, trained on datasets like Common Voice and VoxPopuli.
Transformer Encoder: Utilizes a 1B - param transformer encoder for high - performance speech recognition.
Dual Heads: Comes with a CTC head for character - level transcription and a language identification head.

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

import torch
import torchaudio
from datasets import load_dataset
from transformers import MCTCTForCTC, MCTCTProcessor

model = MCTCTForCTC.from_pretrained("speechbrain/m-ctc-t-large")
processor = MCTCTProcessor.from_pretrained("speechbrain/m-ctc-t-large")

 # load dummy dataset and read soundfiles
ds = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")
 
# feature extraction
input_features = processor(ds[0]["audio"]["array"], sampling_rate=ds[0]["audio"]["sampling_rate"], return_tensors="pt").input_features 

# retrieve logits
with torch.no_grad():
    logits = model(input_features).logits

# take argmax and decode
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)

Results

Results for Common Voice, averaged over all languages:

Character error rate (CER):

"Valid"	"Test"
21.4	23.3

📚 Documentation

Training Method

model image

For more information on how the model was trained, please take a look at the official paper.

Citation

Paper

Authors: Loren Lugosch, Tatiana Likhomanenko, Gabriel Synnaeve, Ronan Collobert

@article{lugosch2021pseudo,
  title={Pseudo-Labeling for Massively Multilingual Speech Recognition},
  author={Lugosch, Loren and Likhomanenko, Tatiana and Synnaeve, Gabriel and Collobert, Ronan},
  journal={ICASSP},
  year={2022}
}

Contribution

A huge thanks to Chan Woo Kim for porting the model from Flashlight C++ to PyTorch.

Questions & Help

If you have questions regarding this model or need help, please consider opening a discussion or pull request on this repo and tag @lorenlugosch, @cwkeam or @patrickvonplaten

📄 License

This project is licensed under the apache - 2.0 license.

Property	Details
Model Type	Massively multilingual speech recognizer (1B - param transformer encoder)
Training Data	Common Voice (version 6.1, December 2020 release), VoxPopuli

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご