đ Malaysian Finetune Whisper Base
This project focuses on fine - tuning the Whisper Base model on a Malaysian dataset. It aims to enhance the model's performance in transcribing Malaysian languages, including Malay and English, with various accents and dialects.
đ Quick Start
đĻ Installation
Ensure you have the necessary libraries installed. You can install them using pip
:
pip install transformers datasets requests
đģ Usage Examples
đ Basic Usage
from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq, pipeline
from datasets import Audio
import requests
sr = 16000
audio = Audio(sampling_rate=sr)
processor = AutoProcessor.from_pretrained("mesolitica/malaysian-whisper-base")
model = AutoModelForSpeechSeq2Seq.from_pretrained("mesolitica/malaysian-whisper-base")
r = requests.get('https://huggingface.co/datasets/huseinzol05/malaya-speech-stt-test-set/resolve/main/test.mp3')
y = audio.decode_example(audio.encode_example(r.content))['array']
inputs = processor([y], return_tensors = 'pt')
r = model.generate(inputs['input_features'], language='ms', return_timestamps=True)
processor.tokenizer.decode(r[0])
The output for Malay language prediction:
'<|startoftranscript|><|ms|><|transcribe|> Zamily On Aging di Vener Australia, Australia yang telah diadakan pada tahun 1982 dan berasaskan unjuran tersebut maka jabatan perangkaan Malaysia menganggarkan menjelang tahun 2005 sejumlah 15% penduduk kita adalah daripada kalangan warga emas. Untuk makluman Tuan Yang Pertua dan juga Alian Bohon, pembangunan sistem pendafiran warga emas ataupun kita sebutkan event adalah usaha kerajaan ke arah merealisasikan objektif yang telah digangkatkan<|endoftext|>'
đ Advanced Usage (Predicting in English)
r = model.generate(inputs['input_features'], language='en', return_timestamps=True)
processor.tokenizer.decode(r[0])
The output for English language prediction:
<|startoftranscript|><|en|><|transcribe|> Assembly on Aging, Divina Australia, Australia, which has been provided in 1982 and the operation of the transportation of Malaysia's implementation to prevent the tourism of the 25th, 15% of our population is from the market. For the information of the President and also the respected, the development of the market system or we have made an event.<|endoftext|>
đ§ Predicting Longer Audio
â ī¸ Important Note
You need to chunk the audio by 30 seconds and predict each sample.
đ Documentation
đ Datasets Used
The model is fine - tuned on the following datasets:
Property |
Details |
Datasets |
1. IMDA STT, https://huggingface.co/datasets/mesolitica/IMDA - STT 2. Pseudolabel Malaysian youtube videos, https://huggingface.co/datasets/mesolitica/pseudolabel - malaysian - youtube - whisper - large - v3 3. Malay Conversational Speech Corpus, https://huggingface.co/datasets/malaysia - ai/malay - conversational - speech - corpus 4. Haqkiem TTS Dataset (private, request access from https://www.linkedin.com/in/haqkiem - daim/) 5. Pseudolabel Nusantara audiobooks, https://huggingface.co/datasets/mesolitica/nusantara - audiobook |
đ Languages Finetuned
ms
, Malay, can be standard Malay and local Malay.
en
, English, can be standard English and Manglish.
đ Project Links
- Script: https://github.com/mesolitica/malaya - speech/tree/malaysian - speech/session/whisper
- Wandb: https://wandb.ai/huseinzol05/malaysian - whisper - base?workspace = user - huseinzol05
- Wandb report: https://wandb.ai/huseinzol05/malaysian - whisper - base/reports/Finetune - Whisper --Vmlldzo2Mzg2NDgx