MahaDhwani_pretrained_conformer Open Source Model - Free support for automatic speech recognition in 22 Indian languages

Mahadhwani Pretrained Conformer

Developed by ai4bharat

A pre-trained Conformer encoder model based on self-supervised learning, supporting automatic speech recognition tasks for 22 scheduled Indian languages.

Speech Recognition Open Source License:MIT #Multilingual speech recognition #Self-supervised pre-training #Conformer encoder

Downloads 349

Release Time : 12/13/2024

Model Overview

This model is a pre-trained Conformer encoder mainly used for automatic speech recognition tasks and supports multilingual processing.

Model Features

Multilingual support

The training data of the model contains data for 22 scheduled Indian languages.

Efficient encoder

Uses the Conformer-Large architecture with an encoder of 120 million parameters.

Self-supervised learning

Adopts the self-supervised learning method for pre-training.

Model Capabilities

Speech recognition

Multilingual processing

Audio feature extraction

Use Cases

Speech recognition

Multilingual speech-to-text

Convert speech in multiple Indian languages to text.

🚀 MahaDhwani Pretrained Conformer

A self - supervised pre - trained conformer encoder model trained on the MahaDhwani dataset, offering valuable embeddings for audio processing.

🚀 Quick Start

To load, train, fine - tune or play with the model you will need to install AI4Bharat NeMo. We recommend you install it using the command shown below

git clone https://github.com/AI4Bharat/NeMo.git && cd NeMo && git checkout nemo-v2 && bash reinstall.sh

✨ Features

Language

The model contains training data from 22 scheduled languages of India.

Input

This model accepts 16000 KHz Mono - channel Audio (wav files) as input.

Output

This model provides conformer encoder embeddings as the output for a given audio sample.

📚 Documentation

Model Architecture

This model is a conformer - Large model, consisting of 120M parameters, as the encoder. The model has 17 conformer blocks with 512 as the model dimension.

💻 Usage Examples

Basic Usage

Download and load the model from Huggingface.

import pydub
import numpy as np
import torch
import nemo.collections.asr as nemo_asr

model = nemo_asr.models.ASRModel.from_pretrained("ai4bharat/MahaDhwani_pretrained_conformer")

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.freeze() # inference mode
model = model.to(device) # transfer model to device

Get an audio file ready by running the command shown below in your terminal. This will convert the audio to 16000 Hz and monochannel.

ffmpeg -i sample_audio.wav -ac 1 -ar 16000 sample_audio_infer_ready.wav

Advanced Usage

# Perform inference on an audio file
wavpath = 'sample.wav'
wav = pydub.AudioSegment.from_file(wavpath).set_frame_rate(16000).set_channels(1)
sarray = wav.get_array_of_samples()
fp_arr = np.array(sarray).T.astype(np.float64)
fp_arr = fp_arr.reshape((1,-1))
feature = torch.from_numpy(fp_arr).float().to(device='cuda')
length=torch.tensor([fp_arr.shape[1]]).to(device='cuda')

spectrograms, spec_masks, encoded, encoded_len = model(input_signal=feature,input_signal_length=length)

📄 License

This project is licensed under the MIT license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご