wav2vec2-large-xlsr-indonesian-artificial Open Source Model - Accurately Achieve Indonesian Speech Recognition

Wav2vec2 Large Xlsr Indonesian Artificial

Developed by cahya

This is an Indonesian speech recognition model fine-tuned based on facebook/wav2vec2-large-xlsr-53, trained on the Common Voice Indonesian dataset.

Speech Recognition OtherOpen Source License:Apache-2.0 #Indonesian Speech Recognition #XLSR Fine-tuning #No Language Model Dependency

Downloads 22

Release Time : 3/2/2022

Model Overview

This model is used for automatic speech recognition tasks in Indonesian, capable of converting Indonesian speech into text.

Model Features

Fine-tuned on XLSR-53

The model is fine-tuned based on the facebook/wav2vec2-large-xlsr-53 architecture, inheriting its powerful speech feature extraction capabilities.

Indonesian Language Support

Specially optimized and trained for Indonesian speech recognition tasks.

16kHz Sampling Rate Support

The model supports 16kHz sampling rate audio input, suitable for most speech recognition applications.

Model Capabilities

Indonesian Speech Recognition

Speech-to-Text

Use Cases

Speech Transcription

Voice Memo Transcription

Convert Indonesian voice memos into searchable text content.

Voice Assistants

Indonesian Voice Command Recognition

Provide speech recognition capabilities for Indonesian voice assistants.

🚀 Wav2Vec2-Large-XLSR-Indonesian

This model is a fine - tuned version of facebook/wav2vec2-large-xlsr-53 on the Indonesian Artificial Common Voice dataset. It's designed for automatic speech recognition tasks, and when using it, ensure your speech input is sampled at 16kHz.

Property	Details
Model Type	XLSR Wav2Vec2 Indonesian with Artificial Voice by Cahya
Training Data	Artificial Common Voice `train`, `validation`, etc. datasets
Metrics	Word Error Rate (WER)
Tags	audio, automatic - speech - recognition, speech, xlsr - fine - tuning - week
License	apache - 2.0

🚀 Quick Start

This fine - tuned model is based on facebook/wav2vec2-large-xlsr-53 and trained on the Indonesian Artificial Common Voice dataset. Remember to sample your speech input at 16kHz when using this model.

💻 Usage Examples

Basic Usage

import torch
import torchaudio
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

test_dataset = load_dataset("common_voice", "id", split="test[:2%]")

processor = Wav2Vec2Processor.from_pretrained("cahya/wav2vec2-large-xlsr-indonesian")
model = Wav2Vec2ForCTC.from_pretrained("cahya/wav2vec2-large-xlsr-indonesian")

resampler = torchaudio.transforms.Resample(48_000, 16_000)

# Preprocessing the datasets.
# We need to read the aduio files as arrays
def speech_file_to_array_fn(batch):
    speech_array, sampling_rate = torchaudio.load(batch["path"])
    batch["speech"] = resampler(speech_array).squeeze().numpy()
    return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)
inputs = processor(test_dataset[:2]["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)

with torch.no_grad():
    logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits

predicted_ids = torch.argmax(logits, dim=-1)

print("Prediction:", processor.batch_decode(predicted_ids))
print("Reference:", test_dataset[:2]["sentence"])

📚 Documentation

Evaluation

The model can be evaluated on the Indonesian test data of Common Voice as follows:

import torch
import torchaudio
from datasets import load_dataset, load_metric
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import re

test_dataset = load_dataset("common_voice", "id", split="test")
wer = load_metric("wer")

processor = Wav2Vec2Processor.from_pretrained("cahya/wav2vec2-large-xlsr-indonesian")
model = Wav2Vec2ForCTC.from_pretrained("cahya/wav2vec2-large-xlsr-indonesian") 
model.to("cuda")

chars_to_ignore_regex = '[\,\?\.\!\-\;\:\"\“\%\‘\'\”\�]'

resampler = torchaudio.transforms.Resample(48_000, 16_000)

# Preprocessing the datasets.
# We need to read the aduio files as arrays
def speech_file_to_array_fn(batch):
    batch["sentence"] = re.sub(chars_to_ignore_regex, '', batch["sentence"]).lower()
    speech_array, sampling_rate = torchaudio.load(batch["path"])
    batch["speech"] = resampler(speech_array).squeeze().numpy()
    return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)

# Preprocessing the datasets.
# We need to read the aduio files as arrays
def evaluate(batch):
    inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)

    with torch.no_grad():
        logits = model(inputs.input_values.to("cuda"), attention_mask=inputs.attention_mask.to("cuda")).logits

    pred_ids = torch.argmax(logits, dim=-1)
    batch["pred_strings"] = processor.batch_decode(pred_ids)
    return batch

result = test_dataset.map(evaluate, batched=True, batch_size=8)

print("WER: {:2f}".format(100 * wer.compute(predictions=result["pred_strings"], references=result["sentence"])))

Test Result: 51.69 %

Training

The Artificial Common Voice train, validation, and other datasets were used for training. The script used for training can be found here (will be available soon)

📄 License

This project is licensed under the apache - 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご