Wav2vec2 Large Xlsr Sundanese
A Sundanese speech recognition model fine-tuned from facebook/wav2vec2-large-xlsr-53, trained on high-quality TTS data from OpenSLR
Downloads 339
Release Time : 3/2/2022
Model Overview
This is an automatic speech recognition (ASR) model for Sundanese, fine-tuned based on Facebook's wav2vec2-large-xlsr-53 architecture, specifically designed for recognizing Sundanese speech input.
Model Features
High Accuracy
Achieves a 6.19% Word Error Rate (WER) on the OpenSLR Sundanese test set
No Language Model Required
Can be used directly without additional language model support
16kHz Sampling Rate Support
Specially optimized for processing speech input at 16kHz sampling rate
Model Capabilities
Sundanese speech recognition
Audio to text conversion
Speech processing
Use Cases
Speech Transcription
Sundanese Speech Transcription
Convert Sundanese speech content into text
Highly accurate transcription results
Voice Assistants
Sundanese Voice Interface
Provide voice control functionality for Sundanese users
đ Wav2Vec2-Large-XLSR-Sundanese
This model is fine-tuned from facebook/wav2vec2-large-xlsr-53 on the Sundanese dataset, aiming to provide high - quality automatic speech recognition for the Sundanese language.
General Information
Property | Details |
---|---|
Language | Sundanese |
Datasets | OpenSLR |
Metrics | WER (Word Error Rate) |
Tags | Audio, Automatic Speech Recognition, Speech, XLSR - Fine - Tuning - Week |
License | Apache 2.0 |
Model Name | XLSR Wav2Vec2 Sundanese by cahya |
Task | Speech Recognition (Automatic Speech Recognition) |
Dataset Name | OpenSLR High quality TTS data for Sundanese |
Test WER | 6.19 |
đ Quick Start
This model is a fine - tuned version of facebook/wav2vec2-large-xlsr-53 on the OpenSLR High quality TTS data for Sundanese. When using this model, ensure that your speech input is sampled at 16kHz.
⨠Features
- Fine - tuned on high - quality Sundanese speech data.
- Can be used directly for speech recognition without a language model.
- Supports evaluation and training using publicly available scripts.
đĻ Installation
No specific installation steps are provided in the original document, so this section is skipped.
đģ Usage Examples
Basic Usage
import torch
import torchaudio
from datasets import load_dataset, load_metric, Dataset
from datasets.utils.download_manager import DownloadManager
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
from pathlib import Path
import pandas as pd
def load_dataset_sundanese():
urls = [
"https://www.openslr.org/resources/44/su_id_female.zip",
"https://www.openslr.org/resources/44/su_id_male.zip"
]
dm = DownloadManager()
download_dirs = dm.download_and_extract(urls)
data_dirs = [
Path(download_dirs[0])/"su_id_female/wavs",
Path(download_dirs[1])/"su_id_male/wavs",
]
filenames = [
Path(download_dirs[0])/"su_id_female/line_index.tsv",
Path(download_dirs[1])/"su_id_male/line_index.tsv",
]
dfs = []
dfs.append(pd.read_csv(filenames[0], sep='\t4?\t', names=["path", "sentence"]))
dfs.append(pd.read_csv(filenames[1], sep='\t\t', names=["path", "sentence"]))
for i, dir in enumerate(data_dirs):
dfs[i]["path"] = dfs[i].apply(lambda row: str(data_dirs[i]) + "/" + row + ".wav", axis=1)
df = pd.concat(dfs)
# df = df.sample(frac=1, random_state=1).reset_index(drop=True)
dataset = Dataset.from_pandas(df)
dataset = dataset.remove_columns('__index_level_0__')
return dataset.train_test_split(test_size=0.1, seed=1)
dataset = load_dataset_sundanese()
test_dataset = dataset['test']
processor = Wav2Vec2Processor.from_pretrained("cahya/wav2vec2-large-xlsr-sundanese")
model = Wav2Vec2ForCTC.from_pretrained("cahya/wav2vec2-large-xlsr-sundanese")
resampler = torchaudio.transforms.Resample(48_000, 16_000)
# Preprocessing the datasets.
# We need to read the audio files as arrays
def speech_file_to_array_fn(batch):
speech_array, sampling_rate = torchaudio.load(batch["path"])
batch["speech"] = resampler(speech_array).squeeze().numpy()
return batch
test_dataset = test_dataset.map(speech_file_to_array_fn)
inputs = processor(test_dataset[:2]["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)
with torch.no_grad():
logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits
predicted_ids = torch.argmax(logits, dim=-1)
print("Prediction:", processor.batch_decode(predicted_ids))
print("Reference:", test_dataset[:2]["sentence"])
đ Documentation
Evaluation
The model can be evaluated as follows or using the notebook.
import torch
import torchaudio
from datasets import load_dataset, load_metric, Dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
from datasets.utils.download_manager import DownloadManager
import re
from pathlib import Path
import pandas as pd
def load_dataset_sundanese():
urls = [
"https://www.openslr.org/resources/44/su_id_female.zip",
"https://www.openslr.org/resources/44/su_id_male.zip"
]
dm = DownloadManager()
download_dirs = dm.download_and_extract(urls)
data_dirs = [
Path(download_dirs[0])/"su_id_female/wavs",
Path(download_dirs[1])/"su_id_male/wavs",
]
filenames = [
Path(download_dirs[0])/"su_id_female/line_index.tsv",
Path(download_dirs[1])/"su_id_male/line_index.tsv",
]
dfs = []
dfs.append(pd.read_csv(filenames[0], sep='\t4?\t', names=["path", "sentence"]))
dfs.append(pd.read_csv(filenames[1], sep='\t\t', names=["path", "sentence"]))
for i, dir in enumerate(data_dirs):
dfs[i]["path"] = dfs[i].apply(lambda row: str(data_dirs[i]) + "/" + row + ".wav", axis=1)
df = pd.concat(dfs)
# df = df.sample(frac=1, random_state=1).reset_index(drop=True)
dataset = Dataset.from_pandas(df)
dataset = dataset.remove_columns('__index_level_0__')
return dataset.train_test_split(test_size=0.1, seed=1)
dataset = load_dataset_sundanese()
test_dataset = dataset['test']
wer = load_metric("wer")
processor = Wav2Vec2Processor.from_pretrained("cahya/wav2vec2-large-xlsr-sundanese")
model = Wav2Vec2ForCTC.from_pretrained("cahya/wav2vec2-large-xlsr-sundanese")
model.to("cuda")
chars_to_ignore_regex = '[\,\?\.\!\-\;\:\"\â\%\â\'\â_\īŋŊ]'
resampler = torchaudio.transforms.Resample(48_000, 16_000)
# Preprocessing the datasets.
# We need to read the aduio files as arrays
def speech_file_to_array_fn(batch):
batch["sentence"] = re.sub(chars_to_ignore_regex, '', batch["sentence"]).lower()
speech_array, sampling_rate = torchaudio.load(batch["path"])
batch["speech"] = resampler(speech_array).squeeze().numpy()
return batch
test_dataset = test_dataset.map(speech_file_to_array_fn)
# Preprocessing the datasets.
# We need to read the audio files as arrays
def evaluate(batch):
inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)
with torch.no_grad():
logits = model(inputs.input_values.to("cuda"), attention_mask=inputs.attention_mask.to("cuda")).logits
pred_ids = torch.argmax(logits, dim=-1)
batch["pred_strings"] = processor.batch_decode(pred_ids)
return batch
result = test_dataset.map(evaluate, batched=True, batch_size=8)
print("WER: {:2f}".format(100 * wer.compute(predictions=result["pred_strings"], references=result["sentence"])))
Test Result: 6.19 %
Training
OpenSLR High quality TTS data for Sundanese was used for training. The script used for training can be found here and to evaluate it
đ License
This project is licensed under the Apache 2.0 license.
Voice Activity Detection
MIT
Voice activity detection model based on pyannote.audio 2.1, used to identify speech activity segments in audio
Speech Recognition
V
pyannote
7.7M
181
Wav2vec2 Large Xlsr 53 Portuguese
Apache-2.0
This is a fine-tuned XLSR-53 large model for Portuguese speech recognition tasks, trained on the Common Voice 6.1 dataset, supporting Portuguese speech-to-text conversion.
Speech Recognition Other
W
jonatasgrosman
4.9M
32
Whisper Large V3
Apache-2.0
Whisper is an advanced automatic speech recognition (ASR) and speech translation model proposed by OpenAI, trained on over 5 million hours of labeled data, with strong cross-dataset and cross-domain generalization capabilities.
Speech Recognition Supports Multiple Languages
W
openai
4.6M
4,321
Whisper Large V3 Turbo
MIT
Whisper is a state-of-the-art automatic speech recognition (ASR) and speech translation model developed by OpenAI, trained on over 5 million hours of labeled data, demonstrating strong generalization capabilities in zero-shot settings.
Speech Recognition
Transformers Supports Multiple Languages

W
openai
4.0M
2,317
Wav2vec2 Large Xlsr 53 Russian
Apache-2.0
A Russian speech recognition model fine-tuned from facebook/wav2vec2-large-xlsr-53, supporting 16kHz sampled audio input
Speech Recognition Other
W
jonatasgrosman
3.9M
54
Wav2vec2 Large Xlsr 53 Chinese Zh Cn
Apache-2.0
A Chinese speech recognition model fine-tuned based on facebook/wav2vec2-large-xlsr-53, supporting 16kHz sampling rate audio input.
Speech Recognition Chinese
W
jonatasgrosman
3.8M
110
Wav2vec2 Large Xlsr 53 Dutch
Apache-2.0
A Dutch speech recognition model fine-tuned based on facebook/wav2vec2-large-xlsr-53, trained on the Common Voice and CSS10 datasets, supporting 16kHz audio input.
Speech Recognition Other
W
jonatasgrosman
3.0M
12
Wav2vec2 Large Xlsr 53 Japanese
Apache-2.0
Japanese speech recognition model fine-tuned from facebook/wav2vec2-large-xlsr-53, supporting 16kHz sampling rate audio input
Speech Recognition Japanese
W
jonatasgrosman
2.9M
33
Mms 300m 1130 Forced Aligner
A text-to-audio forced alignment tool based on Hugging Face pre-trained models, supporting multiple languages with high memory efficiency
Speech Recognition
Transformers Supports Multiple Languages

M
MahmoudAshraf
2.5M
50
Wav2vec2 Large Xlsr 53 Arabic
Apache-2.0
Arabic speech recognition model fine-tuned from facebook/wav2vec2-large-xlsr-53, trained on Common Voice and Arabic speech corpus
Speech Recognition Arabic
W
jonatasgrosman
2.3M
37
Featured Recommended AI Models
Š 2025AIbase