đ Multilingual Speech Recognition for Indonesian Languages
This project focuses on multilingual speech recognition for Indonesian languages. The model built here addresses the challenge of accurately transcribing speech in multiple Indonesian languages. It offers a practical solution for applications requiring speech - to - text conversion in these languages, enhancing accessibility and communication.
đ Quick Start
This is the model built for the project Multilingual Speech Recognition for Indonesian Languages. It is a fine - tuned facebook/wav2vec2-large-xlsr-53 model on the Indonesian Common Voice dataset, High - quality TTS data for Javanese - SLR41, and High - quality TTS data for Sundanese - SLR44 datasets.
We also provide a live demo to test the model.
When using this model, make sure that your speech input is sampled at 16kHz.
⨠Features
- Multilingual Support: Supports languages such as Indonesian (id), Javanese (jv), and Sundanese (sun).
- Fine - Tuned Model: Based on the pre - trained
facebook/wav2vec2 - large - xlsr - 53
model, fine - tuned on specific Indonesian language datasets.
- Live Demo: Allows users to quickly test the model's performance.
đĻ Installation
No specific installation steps are provided in the original document.
đģ Usage Examples
Basic Usage
import torch
import torchaudio
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
test_dataset = load_dataset("common_voice", "id", split="test[:2%]")
processor = Wav2Vec2Processor.from_pretrained("indonesian-nlp/wav2vec2-indonesian-javanese-sundanese")
model = Wav2Vec2ForCTC.from_pretrained("indonesian-nlp/wav2vec2-indonesian-javanese-sundanese")
resampler = torchaudio.transforms.Resample(48_000, 16_000)
def speech_file_to_array_fn(batch):
speech_array, sampling_rate = torchaudio.load(batch["path"])
batch["speech"] = resampler(speech_array).squeeze().numpy()
return batch
test_dataset = test_dataset.map(speech_file_to_array_fn)
inputs = processor(test_dataset[:2]["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)
with torch.no_grad():
logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits
predicted_ids = torch.argmax(logits, dim=-1)
print("Prediction:", processor.batch_decode(predicted_ids))
print("Reference:", test_dataset[:2]["sentence"])
Advanced Usage
import torch
import torchaudio
from datasets import load_dataset, load_metric
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import re
test_dataset = load_dataset("common_voice", "id", split="test")
wer = load_metric("wer")
processor = Wav2Vec2Processor.from_pretrained("indonesian-nlp/wav2vec2-indonesian-javanese-sundanese")
model = Wav2Vec2ForCTC.from_pretrained("indonesian-nlp/wav2vec2-indonesian-javanese-sundanese")
model.to("cuda")
chars_to_ignore_regex = '[\,\?\.\!\-\;\:\"\â\%\â\'\â\īŋŊ]'
resampler = torchaudio.transforms.Resample(48_000, 16_000)
def speech_file_to_array_fn(batch):
batch["sentence"] = re.sub(chars_to_ignore_regex, '', batch["sentence"]).lower()
speech_array, sampling_rate = torchaudio.load(batch["path"])
batch["speech"] = resampler(speech_array).squeeze().numpy()
return batch
test_dataset = test_dataset.map(speech_file_to_array_fn)
def evaluate(batch):
inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)
with torch.no_grad():
logits = model(inputs.input_values.to("cuda"), attention_mask=inputs.attention_mask.to("cuda")).logits
pred_ids = torch.argmax(logits, dim=-1)
batch["pred_strings"] = processor.batch_decode(pred_ids)
return batch
result = test_dataset.map(evaluate, batched=True, batch_size=8)
print("WER: {:2f}".format(100 * wer.compute(predictions=result["pred_strings"], references=result["sentence"])))
Evaluation Results
Test Result: 11.57 %
đ Documentation
Model Information
Property |
Details |
Languages |
id, jv, sun |
Datasets |
mozilla - foundation/common_voice_7_0, openslr, magic_data, titml |
Metrics |
wer |
Tags |
audio, automatic - speech - recognition, hf - asr - leaderboard, id, jv, robust - speech - event, speech, su |
License |
apache - 2.0 |
Model Index
- Name: Wav2Vec2 Indonesian Javanese and Sundanese by Indonesian NLP
- Results:
- Task: Automatic Speech Recognition
- Dataset: Common Voice 6.1
- Test WER: 4.056
- Test CER: 1.472
- Dataset: Common Voice 7
- Test WER: 4.492
- Test CER: 1.577
- Dataset: Robust Speech Event - Dev Data
- Dataset: Robust Speech Event - Test Data
đ§ Technical Details
The Common Voice train
, validation
, and ... datasets were used for training as well as ... and ... # TODO
The script used for training can be found here (will be available soon)
đ License
This project is licensed under the apache - 2.0
license.