đ Wav2Vec2-Large-XLSR-Bengali
This model is a fine - tuned version of facebook/wav2vec2-large-xlsr-53 for Bengali. It uses a subset of 40,000 utterances from the Bengali ASR training data set containing ~196K utterances. The Word Error Rate (WER) is tested using ~4200 utterances held out from training.
đ Quick Start
When using this model, make sure that your speech input is sampled at 16kHz. The train script can be found at train.py
.
- Data Preparation Notebook: Link
- Inference Notebook: Link
⨠Features
- Language: Bengali
- Datasets: OpenSLR
- Metrics: Word Error Rate (WER)
- Tags:
bn
, audio
, automatic-speech-recognition
, speech
- License: CC BY - SA 4.0
đĻ Installation
No specific installation steps are provided in the original document.
đģ Usage Examples
Basic Usage
The model can be used directly (without a language model) as follows:
import torch
import torchaudio
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
processor = Wav2Vec2Processor.from_pretrained("arijitx/wav2vec2-large-xlsr-bengali")
model = Wav2Vec2ForCTC.from_pretrained("arijitx/wav2vec2-large-xlsr-bengali")
resampler = torchaudio.transforms.Resample(TEST_AUDIO_SR, 16_000)
def speech_file_to_array_fn(batch):
speech_array, sampling_rate = torchaudio.load(batch)
speech = resampler(speech_array).squeeze().numpy()
return speech
speech_array = speech_file_to_array_fn("test_file.wav")
inputs = processor(speech_array, sampling_rate=16_000, return_tensors="pt", padding=True)
with torch.no_grad():
logits = model(inputs.input_values).logits
predicted_ids = torch.argmax(logits, dim=-1)
preds = processor.batch_decode(predicted_ids)[0]
print(preds.replace("[PAD]",""))
Test Result: WER on ~4200 utterance : 32.45 %
đ Documentation
The model is fine - tuned on a subset of Bengali ASR training data. The test WER is calculated using held - out data. When using the model, ensure the input speech is sampled at 16kHz.
đ§ Technical Details
The model is a fine - tuned version of facebook/wav2vec2-large-xlsr-53 for Bengali. It uses a subset of 40,000 utterances from the Bengali ASR training data set. The test WER is 32.45% on ~4200 held - out utterances.
đ License
This model is released under the CC BY - SA 4.0 license.
Model Index
Property |
Details |
Model Name |
XLSR Wav2Vec2 Bengali by Arijit |
Task |
Speech Recognition (automatic - speech - recognition) |
Dataset |
OpenSLR (ben) |
Metrics |
Test WER: 32.45% |