Open-Source Cantonese Speech Recognition Model wav2vec2-large-xlsr-cantonese

Wav2vec2 Large Xlsr Cantonese

Developed by ctl

A Cantonese speech recognition model fine-tuned based on Facebook's wav2vec2-large-xlsr-53 model, supporting 16kHz sampled audio input.

Speech Recognition OtherOpen Source License:Apache-2.0 #Cantonese Speech Recognition #Low CER Rate #XLSR Fine-tuning

Downloads 1,010

Release Time : 3/2/2022

Model Overview

This is an Automatic Speech Recognition (ASR) model optimized for Cantonese, based on Facebook's wav2vec2-large-xlsr-53 architecture and fine-tuned using the Common Voice Cantonese dataset.

Model Features

Cantonese Optimization

Specifically fine-tuned for Cantonese speech characteristics to improve recognition accuracy

No Language Model Required

Can be used directly without additional language model support

16kHz Sampling Rate Support

Supports standard 16kHz sampled audio input

Model Capabilities

Cantonese Speech Recognition

Automatic Speech-to-Text

Use Cases

Speech Transcription

Cantonese Speech-to-Text

Convert Cantonese speech content into text

Test CER is 15.36%

Voice Assistant

Cantonese Voice Interaction

Provide voice interaction capability for Cantonese users

🚀 Wav2Vec2-Large-XLSR-53-Cantonese

This model is fine-tuned from facebook/wav2vec2-large-xlsr-53 on Cantonese using the Common Voice. It's designed for automatic speech recognition tasks.

Model Information

Property	Details
Model Type	wav2vec2-large-xlsr-cantonese
Training Data	Common Voice (train, validation sets)
Metrics	CER (Test CER: 15.36)
License	apache-2.0

Important Note

⚠️ Important Note

When using this model, make sure that your speech input is sampled at 16kHz.

🚀 Quick Start

This fine-tuned model is based on facebook/wav2vec2-large-xlsr-53 and trained on Cantonese with the Common Voice.

💻 Usage Examples

Basic Usage

import torch
import torchaudio
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

test_dataset = load_dataset("common_voice", "zh-HK", split="test[:2%]")

processor = Wav2Vec2Processor.from_pretrained("ctl/wav2vec2-large-xlsr-cantonese") 
model = Wav2Vec2ForCTC.from_pretrained("ctl/wav2vec2-large-xlsr-cantonese")

resampler = torchaudio.transforms.Resample(48_000, 16_000)

# Preprocessing the datasets.
# We need to read the aduio files as arrays
def speech_file_to_array_fn(batch):
	speech_array, sampling_rate = torchaudio.load(batch["path"])
	batch["speech"] = resampler(speech_array).squeeze().numpy()
	return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)
inputs = processor(test_dataset["speech"][:2], sampling_rate=16_000, return_tensors="pt", padding=True)

with torch.no_grad():
	logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits

predicted_ids = torch.argmax(logits, dim=-1)

print("Prediction:", processor.batch_decode(predicted_ids))
print("Reference:", test_dataset["sentence"][:2])

📚 Documentation

Evaluation

The model can be evaluated on the Chinese (Hong Kong) test data of Common Voice as follows:

!pip install jiwer
import torch
import torchaudio
from datasets import load_dataset, load_metric
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import re
import argparse

lang_id = "zh-HK" 
model_id = "ctl/wav2vec2-large-xlsr-cantonese"

chars_to_ignore_regex = '[\,\?\.\!\-\;\:"\“\%\‘\”\�\．\⋯\！\－\：\–\。\》\,\）\,\？\；\～\~\…\︰\，\（\」\‧\《\﹔\、\—\／\,\「\﹖\·\']'

test_dataset = load_dataset("common_voice", f"{lang_id}", split="test") 
cer = load_metric("cer")

processor = Wav2Vec2Processor.from_pretrained(f"{model_id}") 
model = Wav2Vec2ForCTC.from_pretrained(f"{model_id}") 
model.to("cuda")

resampler = torchaudio.transforms.Resample(48_000, 16_000)

# Preprocessing the datasets.
# We need to read the aduio files as arrays
def speech_file_to_array_fn(batch):
    batch["sentence"] = re.sub(chars_to_ignore_regex, '', batch["sentence"]).lower()
    speech_array, sampling_rate = torchaudio.load(batch["path"])
    batch["speech"] = resampler(speech_array).squeeze().numpy()
    return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)

# Preprocessing the datasets.
# We need to read the aduio files as arrays
def evaluate(batch):
    inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)
    with torch.no_grad():
        logits = model(inputs.input_values.to("cuda"), attention_mask=inputs.attention_mask.to("cuda")).logits

    pred_ids = torch.argmax(logits, dim=-1)
    batch["pred_strings"] = processor.batch_decode(pred_ids)
    return batch

result = test_dataset.map(evaluate, batched=True, batch_size=16)

print("CER: {:2f}".format(100 * cer.compute(predictions=result["pred_strings"], references=result["sentence"])))

Test Result: 15.51 %

Training

The Common Voice train and validation sets were used for training. The script used for training will be posted here.

📄 License

This model is released under the apache-2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご