Wav2vec2 Large Xlsr Javanese
A Javanese automatic speech recognition model fine-tuned from facebook/wav2vec2-large-xlsr-53, trained on high-quality Javanese TTS data from OpenSLR.
Downloads 659
Release Time : 3/2/2022
Model Overview
This is an optimized automatic speech recognition model for Javanese, capable of converting Javanese speech into text.
Model Features
High-quality Javanese recognition
A speech recognition model specifically optimized for Javanese, achieving a WER of 17.61% on the OpenSLR dataset.
Based on XLSR pre-trained model
Fine-tuned from facebook/wav2vec2-large-xlsr-53, leveraging large-scale cross-lingual speech representation learning.
No language model required
Can be used directly without additional language model support.
Model Capabilities
Javanese speech recognition
Automatic speech-to-text
Use Cases
Speech transcription
Javanese speech transcription
Convert Javanese speech content into text format
Achieved a word error rate of 17.61% on the test set
Voice assistants
Javanese voice interaction
Used to develop voice assistant applications supporting Javanese
đ Wav2Vec2-Large-XLSR-Javanese
This is a fine-tuned model based on facebook/wav2vec2-large-xlsr-53, which can be used for Javanese automatic speech recognition.
Metadata
Property | Details |
---|---|
Language | Javanese |
Datasets | OpenSLR |
Metrics | WER |
Tags | audio, automatic-speech-recognition, speech, xlsr-fine-tuning-week |
License | Apache-2.0 |
Model Name | XLSR Wav2Vec2 Javanese by cahya |
Task | Speech Recognition (automatic-speech-recognition) |
Dataset Name | OpenSLR High quality TTS data for Javanese |
Dataset Type | OpenSLR |
Dataset Args | jv |
Metric Name | Test WER |
Metric Type | wer |
Metric Value | 17.61 |
đ Quick Start
The model is fine-tuned from facebook/wav2vec2-large-xlsr-53 on the OpenSLR High quality TTS data for Javanese. When using this model, ensure that your speech input is sampled at 16kHz.
đģ Usage Examples
Basic Usage
import torch
import torchaudio
from datasets import load_dataset, load_metric, Dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
from datasets.utils.download_manager import DownloadManager
from pathlib import Path
import pandas as pd
def load_dataset_javanese():
urls = [
"https://www.openslr.org/resources/41/jv_id_female.zip",
"https://www.openslr.org/resources/41/jv_id_male.zip"
]
dm = DownloadManager()
download_dirs = dm.download_and_extract(urls)
data_dirs = [
Path(download_dirs[0])/"jv_id_female/wavs",
Path(download_dirs[1])/"jv_id_male/wavs",
]
filenames = [
Path(download_dirs[0])/"jv_id_female/line_index.tsv",
Path(download_dirs[1])/"jv_id_male/line_index.tsv",
]
dfs = []
dfs.append(pd.read_csv(filenames[0], sep='\t', names=["path", "sentence"]))
dfs.append(pd.read_csv(filenames[1], sep='\t', names=["path", "client_id", "sentence"]))
dfs[1] = dfs[1].drop(["client_id"], axis=1)
for i, dir in enumerate(data_dirs):
dfs[i]["path"] = dfs[i].apply(lambda row: str(data_dirs[i]) + "/" + row + ".wav", axis=1)
df = pd.concat(dfs)
# df = df.sample(frac=1, random_state=1).reset_index(drop=True)
dataset = Dataset.from_pandas(df)
dataset = dataset.remove_columns('__index_level_0__')
return dataset.train_test_split(test_size=0.1, seed=1)
dataset = load_dataset_javanese()
test_dataset = dataset['test']
processor = Wav2Vec2Processor.from_pretrained("cahya/wav2vec2-large-xlsr-javanese")
model = Wav2Vec2ForCTC.from_pretrained("cahya/wav2vec2-large-xlsr-javanese")
resampler = torchaudio.transforms.Resample(48_000, 16_000)
# Preprocessing the datasets.
# We need to read the aduio files as arrays
def speech_file_to_array_fn(batch):
speech_array, sampling_rate = torchaudio.load(batch["path"])
batch["speech"] = resampler(speech_array).squeeze().numpy()
return batch
test_dataset = test_dataset.map(speech_file_to_array_fn)
inputs = processor(test_dataset[:2]["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)
with torch.no_grad():
logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits
predicted_ids = torch.argmax(logits, dim=-1)
print("Prediction:", processor.batch_decode(predicted_ids))
print("Reference:", test_dataset[:2]["sentence"])
đ Documentation
Evaluation
The model can be evaluated as follows or using this notebook
import torch
import torchaudio
from datasets import load_dataset, load_metric, Dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import re
from datasets.utils.download_manager import DownloadManager
from pathlib import Path
import pandas as pd
def load_dataset_javanese():
urls = [
"https://www.openslr.org/resources/41/jv_id_female.zip",
"https://www.openslr.org/resources/41/jv_id_male.zip"
]
dm = DownloadManager()
download_dirs = dm.download_and_extract(urls)
data_dirs = [
Path(download_dirs[0])/"jv_id_female/wavs",
Path(download_dirs[1])/"jv_id_male/wavs",
]
filenames = [
Path(download_dirs[0])/"jv_id_female/line_index.tsv",
Path(download_dirs[1])/"jv_id_male/line_index.tsv",
]
dfs = []
dfs.append(pd.read_csv(filenames[0], sep='\t', names=["path", "sentence"]))
dfs.append(pd.read_csv(filenames[1], sep='\t', names=["path", "client_id", "sentence"]))
dfs[1] = dfs[1].drop(["client_id"], axis=1)
for i, dir in enumerate(data_dirs):
dfs[i]["path"] = dfs[i].apply(lambda row: str(data_dirs[i]) + "/" + row + ".wav", axis=1)
df = pd.concat(dfs)
# df = df.sample(frac=1, random_state=1).reset_index(drop=True)
dataset = Dataset.from_pandas(df)
dataset = dataset.remove_columns('__index_level_0__')
return dataset.train_test_split(test_size=0.1, seed=1)
dataset = load_dataset_javanese()
test_dataset = dataset['test']
wer = load_metric("wer")
processor = Wav2Vec2Processor.from_pretrained("cahya/wav2vec2-large-xlsr-javanese")
model = Wav2Vec2ForCTC.from_pretrained("cahya/wav2vec2-large-xlsr-javanese")
model.to("cuda")
chars_to_ignore_regex = '[\,\?\.\!\-\;\:\"\â\%\â\'\â_\īŋŊ]'
resampler = torchaudio.transforms.Resample(48_000, 16_000)
# Preprocessing the datasets.
# We need to read the aduio files as arrays
def speech_file_to_array_fn(batch):
batch["sentence"] = re.sub(chars_to_ignore_regex, '', batch["sentence"]).lower()
speech_array, sampling_rate = torchaudio.load(batch["path"])
batch["speech"] = resampler(speech_array).squeeze().numpy()
return batch
test_dataset = test_dataset.map(speech_file_to_array_fn)
# Preprocessing the datasets.
# We need to read the aduio files as arrays
def evaluate(batch):
inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)
with torch.no_grad():
logits = model(inputs.input_values.to("cuda"), attention_mask=inputs.attention_mask.to("cuda")).logits
pred_ids = torch.argmax(logits, dim=-1)
batch["pred_strings"] = processor.batch_decode(pred_ids)
return batch
result = test_dataset.map(evaluate, batched=True, batch_size=8)
print("WER: {:2f}".format(100 * wer.compute(predictions=result["pred_strings"], references=result["sentence"])))
Test Result: 17.61 %
Training
OpenSLR High quality TTS data for Javanese was used for training. The script used for training can be found here and to evaluate it
đ License
This project is licensed under the Apache-2.0 license.
Voice Activity Detection
MIT
Voice activity detection model based on pyannote.audio 2.1, used to identify speech activity segments in audio
Speech Recognition
V
pyannote
7.7M
181
Wav2vec2 Large Xlsr 53 Portuguese
Apache-2.0
This is a fine-tuned XLSR-53 large model for Portuguese speech recognition tasks, trained on the Common Voice 6.1 dataset, supporting Portuguese speech-to-text conversion.
Speech Recognition Other
W
jonatasgrosman
4.9M
32
Whisper Large V3
Apache-2.0
Whisper is an advanced automatic speech recognition (ASR) and speech translation model proposed by OpenAI, trained on over 5 million hours of labeled data, with strong cross-dataset and cross-domain generalization capabilities.
Speech Recognition Supports Multiple Languages
W
openai
4.6M
4,321
Whisper Large V3 Turbo
MIT
Whisper is a state-of-the-art automatic speech recognition (ASR) and speech translation model developed by OpenAI, trained on over 5 million hours of labeled data, demonstrating strong generalization capabilities in zero-shot settings.
Speech Recognition
Transformers Supports Multiple Languages

W
openai
4.0M
2,317
Wav2vec2 Large Xlsr 53 Russian
Apache-2.0
A Russian speech recognition model fine-tuned from facebook/wav2vec2-large-xlsr-53, supporting 16kHz sampled audio input
Speech Recognition Other
W
jonatasgrosman
3.9M
54
Wav2vec2 Large Xlsr 53 Chinese Zh Cn
Apache-2.0
A Chinese speech recognition model fine-tuned based on facebook/wav2vec2-large-xlsr-53, supporting 16kHz sampling rate audio input.
Speech Recognition Chinese
W
jonatasgrosman
3.8M
110
Wav2vec2 Large Xlsr 53 Dutch
Apache-2.0
A Dutch speech recognition model fine-tuned based on facebook/wav2vec2-large-xlsr-53, trained on the Common Voice and CSS10 datasets, supporting 16kHz audio input.
Speech Recognition Other
W
jonatasgrosman
3.0M
12
Wav2vec2 Large Xlsr 53 Japanese
Apache-2.0
Japanese speech recognition model fine-tuned from facebook/wav2vec2-large-xlsr-53, supporting 16kHz sampling rate audio input
Speech Recognition Japanese
W
jonatasgrosman
2.9M
33
Mms 300m 1130 Forced Aligner
A text-to-audio forced alignment tool based on Hugging Face pre-trained models, supporting multiple languages with high memory efficiency
Speech Recognition
Transformers Supports Multiple Languages

M
MahmoudAshraf
2.5M
50
Wav2vec2 Large Xlsr 53 Arabic
Apache-2.0
Arabic speech recognition model fine-tuned from facebook/wav2vec2-large-xlsr-53, trained on Common Voice and Arabic speech corpus
Speech Recognition Arabic
W
jonatasgrosman
2.3M
37
Featured Recommended AI Models
Š 2025AIbase