wav2vec2-large-xlsr-nahuatl Open-Source Speech Recognition Model - Free Recognition of Nahuatl (ncj Dialect)

Home

Wav2vec2 Large Xlsr Nahuatl

Developed by tyoc213

A Nahuatl (ncj dialect) speech recognition model fine-tuned based on facebook/wav2vec2-large-xlsr-53

Speech Recognition

Transformers

Open Source License:Apache-2.0 #Nahuatl speech recognition #Low-resource language processing #XLSR fine-tuning

Downloads 18

Release Time : 3/2/2022

Model Overview

This model is an automatic speech recognition system for the Nahuatl language (ncj dialect) from northern Puebla, Mexico, fine-tuned on the XLSR-53 architecture, supporting direct speech-to-text functionality

Model Features

Multilingual data augmentation

Incorporated Spanish and German sample data from Common Voice during training to enhance model robustness

No language model required

Can be used directly for speech recognition without additional language model support

Low-resource language support

Specifically optimized for low-resource languages like Nahuatl

Model Capabilities

Speech recognition

Nahuatl speech-to-text

Multi-dialect adaptation

Use Cases

Language preservation

Nahuatl speech transcription

Transcribing spoken Nahuatl content into text for language documentation

WER 69.11%

Educational applications

Language learning assistance

Helping learners practice Nahuatl pronunciation and listening

🚀 Wav2Vec2-Large-XLSR-53-ncj/nah

This model is fine-tuned from facebook/wav2vec2-large-xlsr-53 on Nahuatl specifically of the North of Puebla (ncj), aiming to solve the problem of automatic speech recognition in Nahuatl and provide high - quality speech recognition results.

📋 Information Table

Property	Details
Language	nah specifically ncj
Datasets	Created a new dataset based on SLR92, and some samples of `es` and `de` datasets from Common Voice
Metrics	wer
Tags	audio, automatic - speech - recognition, speech, xlsr - fine - tuning - week
License	apache - 2.0
Model Name	Nahuatl XLSR Wav2Vec 53
Task	Speech Recognition (automatic - speech - recognition)
Test WER	69.11

🚀 Quick Start

Fine-tuned facebook/wav2vec2-large-xlsr-53 on Nahuatl specifically of the North of Puebla (ncj) using a derivate of SLR92, and some samples of es and de datasets from Common Voice.

💻 Usage Examples

Basic Usage

import torch
import torchaudio
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

test_dataset = load_dataset("common_voice", "{lang_id}", split="test[:2%]") # TODO: publish nahuatl_slr92_by_sentence

processor = Wav2Vec2Processor.from_pretrained("tyoc213/wav2vec2-large-xlsr-nahuatl")
model = Wav2Vec2ForCTC.from_pretrained("tyoc213/wav2vec2-large-xlsr-nahuatl")

resampler = torchaudio.transforms.Resample(48_000, 16_000)

# Preprocessing the datasets.
# We need to read the aduio files as arrays
def speech_file_to_array_fn(batch):
	speech_array, sampling_rate = torchaudio.load(batch["path"])
	batch["speech"] = resampler(speech_array).squeeze().numpy()
	return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)
inputs = processor(test_dataset["speech"][:2], sampling_rate=16_000, return_tensors="pt", padding=True)

with torch.no_grad():
	logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits

predicted_ids = torch.argmax(logits, dim=-1)

print("Prediction:", processor.batch_decode(predicted_ids))
print("Reference:", test_dataset["sentence"][:2])

📚 Documentation

Evaluation

The model can be evaluated as follows on the Nahuatl specifically of the North of Puebla (ncj) test data of Common Voice.

import torch
import torchaudio
from datasets import load_dataset, load_metric
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import re

test_dataset = load_dataset("common_voice", "{lang_id}", split="test") # TODO: publish nahuatl_slr92_by_sentence
wer = load_metric("wer")

processor = Wav2Vec2Processor.from_pretrained("tyoc213/wav2vec2-large-xlsr-nahuatl")
model = Wav2Vec2ForCTC.from_pretrained("tyoc213/wav2vec2-large-xlsr-nahuatl")
model.to("cuda")

chars_to_ignore_regex = '[\,\?\.\!\-\;\"\“\%\‘\”\�\(\)\-]'
resampler = torchaudio.transforms.Resample(48_000, 16_000)

# Preprocessing the datasets.
# We need to read the aduio files as arrays
def speech_file_to_array_fn(batch):
	batch["sentence"] = re.sub(chars_to_ignore_regex, '', batch["sentence"]).lower()
	speech_array, sampling_rate = torchaudio.load(batch["path"])
	batch["speech"] = resampler(speech_array).squeeze().numpy()
	return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)

# Preprocessing the datasets.
# We need to read the aduio files as arrays
def evaluate(batch):
	inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)

	with torch.no_grad():
		logits = model(inputs.input_values.to("cuda"), attention_mask=inputs.attention_mask.to("cuda")).logits

	pred_ids = torch.argmax(logits, dim=-1)
	batch["pred_strings"] = processor.batch_decode(pred_ids)
	return batch

result = test_dataset.map(evaluate, batched=True, batch_size=8)

print("WER: {:2f}".format(100 * wer.compute(predictions=result["pred_strings"], references=result["sentence"])))

Test Result: 50.95 %

Training

A derivate of SLR92 to be published soon. And some samples of es and de datasets from Common Voice

The script used for training can be found less60wer.ipynb

📄 License

This project is under the apache - 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご