wav2vec2-large-xlsr-53-hk Open-source Cantonese Speech Recognition Model - Free and Accurate Cantonese Speech Recognition

Wav2vec2 Large Xlsr 53 Hk

Developed by voidful

A speech recognition model fine-tuned on Cantonese (using the Common Voice dataset) based on facebook/wav2vec2-large-xlsr-53

Speech Recognition

Transformers

Open Source License:Apache-2.0 #Cantonese Speech Recognition #Low CER Rate #16kHz Sampling Rate

Downloads 26

Release Time : 3/2/2022

Model Overview

This is an automatic speech recognition model optimized for Cantonese (Hong Kong), based on the Wav2Vec2 architecture, suitable for converting Cantonese speech to text.

Model Features

Cantonese Optimization

Specially fine-tuned for the Cantonese (Hong Kong) dialect to improve recognition accuracy

Based on XLSR Model

Built on the powerful wav2vec2-large-xlsr-53 foundation, with excellent speech feature extraction capabilities

16kHz Sampling Rate Support

Optimized for processing speech input at 16kHz sampling rate

Model Capabilities

Cantonese Speech Recognition

Speech-to-Text

Audio Content Transcription

Use Cases

Speech Transcription

Cantonese Meeting Minutes

Automatically convert Cantonese meeting recordings into text transcripts

CER 16.41

Media Content Subtitle Generation

Automatically generate subtitles for Cantonese video content

Voice Assistants

Cantonese Voice Command Recognition

Used for supporting Cantonese voice control in smart devices

🚀 Wav2Vec2-Large-XLSR-53-hk

This project is a fine - tuned version of facebook/wav2vec2-large-xlsr-53 on Cantonese, aiming to provide high - quality automatic speech recognition for Cantonese.

🚀 Quick Start

The model is fine - tuned from facebook/wav2vec2-large-xlsr-53 on Cantonese using the Common Voice. When using this model, ensure that your speech input is sampled at 16kHz.

✨ Features

Language Adaptation: Specifically fine - tuned for Cantonese, enhancing recognition accuracy for Cantonese speech.
Model Compatibility: Based on the popular Wav2Vec2 - Large - XLSR - 53 architecture, ensuring high - performance and wide - range applicability.

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

import torchaudio
from datasets import load_dataset, load_metric
from transformers import (
    Wav2Vec2ForCTC,
    Wav2Vec2Processor,
)
import torch
import re
import sys

model_name = "voidful/wav2vec2-large-xlsr-53-hk"
device = "cuda"
processor_name = "voidful/wav2vec2-large-xlsr-53-hk"

chars_to_ignore_regex = r"[¥•＂＃＄％＆＇（）＊＋，－／：；＜＝＞＠［＼］＾＿｀｛｜｝～｟｠｢｣､　、〃〈〉《》「」『』【】〔〕〖〗〘〙〚〛〜〝〞〟〰〾〿–—‘’‛“”„‟…‧﹏﹑﹔·'℃°•·．﹑︰〈〉─《﹖﹣﹂﹁﹔！？｡。＂＃＄％＆＇（）＊＋，﹐－／：；＜＝＞＠［＼］＾＿｀｛｜｝～｟｠｢｣､、〃》「」『』【】〔〕〖〗〘〙〚〛〜〝〞〟〰〾〿–—‘’‛“”„‟…‧﹏.．!\\"#$%&()*+,\\-.\\:;<=>?@\\[\\]\\\\\\/^_`{|}~]"

model = Wav2Vec2ForCTC.from_pretrained(model_name).to(device)
processor = Wav2Vec2Processor.from_pretrained(processor_name)

resampler = torchaudio.transforms.Resample(orig_freq=48_000, new_freq=16_000)

def load_file_to_data(file):
    batch = {}
    speech, _ = torchaudio.load(file)
    batch["speech"] = resampler.forward(speech.squeeze(0)).numpy()
    batch["sampling_rate"] = resampler.new_freq
    return batch


def predict(data):
    features = processor(data["speech"], sampling_rate=data["sampling_rate"], padding=True, return_tensors="pt")
    input_values = features.input_values.to(device)
    attention_mask = features.attention_mask.to(device)
    with torch.no_grad():
        logits = model(input_values, attention_mask=attention_mask).logits
    pred_ids = torch.argmax(logits, dim=-1)
    return processor.batch_decode(pred_ids)

Advanced Usage

# You can use the following code to conduct a prediction
predict(load_file_to_data('voice file path'))

📚 Documentation

You can try the model through this Colab trial.

🔧 Technical Details

The model is evaluated on the Cantonese (Hong Kong) test data of Common Voice. The CER calculation refers to https://huggingface.co/ctl/wav2vec2-large-xlsr-cantonese.

!mkdir cer
!wget -O cer/cer.py https://huggingface.co/ctl/wav2vec2-large-xlsr-cantonese/raw/main/cer.py
!pip install jiwer

import torchaudio
from datasets import load_dataset, load_metric
from transformers import (
    Wav2Vec2ForCTC,
    Wav2Vec2Processor,
)
import torch
import re
import sys

cer = load_metric("./cer")
model_name = "voidful/wav2vec2-large-xlsr-53-hk"
device = "cuda"
processor_name = "voidful/wav2vec2-large-xlsr-53-hk"

chars_to_ignore_regex = r"[¥•＂＃＄％＆＇（）＊＋，－／：；＜＝＞＠［＼］＾＿｀｛｜｝～｟｠｢｣､　、〃〈〉《》「」『』【】〔〕〖〗〘〙〚〛〜〝〞〟〰〾〿–—‘’‛“”„‟…‧﹏﹑﹔·'℃°•·．﹑︰〈〉─《﹖﹣﹂﹁﹔！？｡。＂＃＄％＆＇（）＊＋，﹐－／：；＜＝＞＠［＼］＾＿｀｛｜｝～｟｠｢｣､、〃》「」『』【】〔〕〖〗〘〙〚〛〜〝〞〟〰〾〿–—‘’‛“”„‟…‧﹏.．!\\"#$%&()*+,\\-.\\:;<=>?@\\[\\]\\\\\\/^_`{|}~]"

model = Wav2Vec2ForCTC.from_pretrained(model_name).to(device)
processor = Wav2Vec2Processor.from_pretrained(processor_name)

ds = load_dataset("common_voice", 'zh-HK', data_dir="./cv-corpus-6.1-2020-12-11", split="test")

resampler = torchaudio.transforms.Resample(orig_freq=48_000, new_freq=16_000)

def map_to_array(batch):
    speech, _ = torchaudio.load(batch["path"])
    batch["speech"] = resampler.forward(speech.squeeze(0)).numpy()
    batch["sampling_rate"] = resampler.new_freq
    batch["sentence"] = re.sub(chars_to_ignore_regex, '', batch["sentence"]).lower().replace("’", "'")
    return batch

ds = ds.map(map_to_array)

def map_to_pred(batch):
    features = processor(batch["speech"], sampling_rate=batch["sampling_rate"][0], padding=True, return_tensors="pt")
    input_values = features.input_values.to(device)
    attention_mask = features.attention_mask.to(device)
    with torch.no_grad():
        logits = model(input_values, attention_mask=attention_mask).logits
    pred_ids = torch.argmax(logits, dim=-1)
    batch["predicted"] = processor.batch_decode(pred_ids)
    batch["target"] = batch["sentence"]
    return batch

result = ds.map(map_to_pred, batched=True, batch_size=16, remove_columns=list(ds.features.keys()))

print("CER: {:2f}".format(100 * cer.compute(predictions=result["predicted"], references=result["target"])))

The CER of the model is 16.41.

📄 License

This project is licensed under the Apache - 2.0 license.

Property	Details
Model Type	Fine - tuned Wav2Vec2 - Large - XLSR - 53 for Cantonese
Training Data	Common Voice (zh - HK)

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご