đ MahaDhwani Pretrained Conformer
A self - supervised pre - trained conformer encoder model trained on the MahaDhwani dataset, offering valuable embeddings for audio processing.
đ Quick Start
To load, train, fine - tune or play with the model you will need to install AI4Bharat NeMo. We recommend you install it using the command shown below
git clone https://github.com/AI4Bharat/NeMo.git && cd NeMo && git checkout nemo-v2 && bash reinstall.sh
⨠Features
Language
The model contains training data from 22 scheduled languages of India.
Input
This model accepts 16000 KHz Mono - channel Audio (wav files) as input.
Output
This model provides conformer encoder embeddings as the output for a given audio sample.
đ Documentation
Model Architecture
This model is a conformer - Large model, consisting of 120M parameters, as the encoder. The model has 17 conformer blocks with 512 as the model dimension.
đģ Usage Examples
Basic Usage
Download and load the model from Huggingface.
import pydub
import numpy as np
import torch
import nemo.collections.asr as nemo_asr
model = nemo_asr.models.ASRModel.from_pretrained("ai4bharat/MahaDhwani_pretrained_conformer")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.freeze()
model = model.to(device)
Get an audio file ready by running the command shown below in your terminal. This will convert the audio to 16000 Hz and monochannel.
ffmpeg -i sample_audio.wav -ac 1 -ar 16000 sample_audio_infer_ready.wav
Advanced Usage
wavpath = 'sample.wav'
wav = pydub.AudioSegment.from_file(wavpath).set_frame_rate(16000).set_channels(1)
sarray = wav.get_array_of_samples()
fp_arr = np.array(sarray).T.astype(np.float64)
fp_arr = fp_arr.reshape((1,-1))
feature = torch.from_numpy(fp_arr).float().to(device='cuda')
length=torch.tensor([fp_arr.shape[1]]).to(device='cuda')
spectrograms, spec_masks, encoded, encoded_len = model(input_signal=feature,input_signal_length=length)
đ License
This project is licensed under the MIT license.