wav2vec2-large-chinese-zh-cn Open-source Speech Recognition Model - Supports Accurate Recognition of 16kHz Chinese Speech

Home

Wav2vec2 Large Chinese Zh Cn

Developed by wbbbbb

Chinese speech recognition model fine-tuned based on XLSR-53 large model, supporting 16kHz sampled audio input

Speech Recognition

Transformers

ChineseOpen Source License:Apache-2.0 #Chinese Speech Recognition #XLSR Fine-tuning #Multi-source Data Training

Downloads 585

Release Time : 7/18/2022

Model Overview

This model is a fine-tuned XLSR-53 large model for Chinese speech recognition tasks, trained on Chinese speech datasets such as Common Voice, and can be directly used for speech-to-text tasks

Model Features

Chinese Speech Recognition Optimization

Specially fine-tuned for Chinese speech characteristics, outperforming general models in Chinese speech recognition tasks

Multi-dataset Training

Trained using multiple Chinese speech datasets including Common Voice 6.1, CSS10, and ST-CMDS

No Language Model Required

Can be used directly without additional language model support

Model Capabilities

Chinese Speech Recognition

Speech-to-Text

16kHz Audio Processing

Use Cases

Speech Transcription

Automatic Meeting Minutes Transcription

Automatically convert Chinese meeting recordings into text records

Voice Note Conversion

Convert personal voice memos into searchable text

Accessibility Applications

Real-time Caption Generation

Provide real-time speech-to-text services for hearing-impaired users

🚀 Fine-tuned XLSR-53 large model for speech recognition in Chinese

This is a fine-tuned model based on facebook/wav2vec2-large-xlsr-53 for Chinese speech recognition, which can accurately transcribe Chinese speech.

🚀 Quick Start

This model is fine-tuned from facebook/wav2vec2-large-xlsr-53 on Chinese using the train and validation splits of Common Voice 6.1, CSS10 and ST-CMDS. When using this model, make sure that your speech input is sampled at 16kHz.

This model has been fine-tuned on RTX3090 for 50h.

The script used for training can be found here: https://github.com/jonatasgrosman/wav2vec2-sprint

✨ Features

High Compatibility: Fine - tuned on multiple Chinese datasets, suitable for various Chinese speech scenarios.
Ease of Use: Can be used directly without a language model.

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

The model can be used directly (without a language model) as follows... Using the HuggingSound library:

from huggingsound import SpeechRecognitionModel
model = SpeechRecognitionModel("wbbbbb/wav2vec2-large-chinese-zh-cn")
audio_paths = ["/path/to/file.mp3", "/path/to/another_file.wav"]
transcriptions = model.transcribe(audio_paths)

📚 Documentation

Evaluation

The model can be evaluated as follows on the Chinese (zh-CN) test data of Common Voice.

import torch
import re
import librosa
from datasets import load_dataset, load_metric
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import warnings
import os

os.environ["KMP_AFFINITY"] = ""


LANG_ID = "zh-CN"
MODEL_ID = "zh-CN-output-aishell"
DEVICE = "cuda"

test_dataset = load_dataset("common_voice", LANG_ID, split="test")

wer = load_metric("wer")
cer = load_metric("cer")



processor = Wav2Vec2Processor.from_pretrained(MODEL_ID)
model = Wav2Vec2ForCTC.from_pretrained(MODEL_ID)
model.to(DEVICE)

# Preprocessing the datasets.
# We need to read the audio files as arrays
def speech_file_to_array_fn(batch):
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")
        speech_array, sampling_rate = librosa.load(batch["path"], sr=16_000)
    batch["speech"] = speech_array
    batch["sentence"] = (
        re.sub("([^\u4e00-\u9fa5\u0030-\u0039])", "", batch["sentence"]).lower() + " "
    )
    return batch


test_dataset = test_dataset.map(
    speech_file_to_array_fn,
    num_proc=15,
    remove_columns=['client_id', 'up_votes', 'down_votes', 'age', 'gender', 'accent', 'locale', 'segment'],
)

# Preprocessing the datasets.
# We need to read the audio files as arrays
def evaluate(batch):
    inputs = processor(
        batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True
    )

    with torch.no_grad():
        logits = model(
            inputs.input_values.to(DEVICE),
            attention_mask=inputs.attention_mask.to(DEVICE),
        ).logits

    pred_ids = torch.argmax(logits, dim=-1)
    batch["pred_strings"] = processor.batch_decode(pred_ids)
    return batch


result = test_dataset.map(evaluate, batched=True, batch_size=8)

predictions = [x.lower() for x in result["pred_strings"]]
references = [x.lower() for x in result["sentence"]]

print(
    f"WER: {wer.compute(predictions=predictions, references=references, chunk_size=1000) * 100}"
)
print(f"CER: {cer.compute(predictions=predictions, references=references) * 100}")

Test Result:

In the table below I report the Word Error Rate (WER) and the Character Error Rate (CER) of the model. I ran the evaluation script described above on other models as well (on 2022-07-18). Note that the table below may show different results from those already reported, this may have been caused due to some specificity of the other evaluation scripts used.

Model	WER	CER
wbbbbb/wav2vec2-large-chinese-zh-cn	70.47%	12.30%
jonatasgrosman/wav2vec2-large-xlsr-53-chinese-zh-cn	82.37%	19.03%
ydshieh/wav2vec2-large-xlsr-53-chinese-zh-cn-gpt	84.01%	20.95%

📄 License

This model is licensed under the apache-2.0 license.

🔧 Technical Details

Model Type: Fine - tuned XLSR - 53 large model
Training Data: Train and validation splits of Common Voice 6.1, CSS10 and ST - CMDS
Training Environment: Fine - tuned on RTX3090 for 50h

📖 Citation

If you want to cite this model you can use this:

@misc{grosman2021xlsr53-large-chinese,
  title={Fine-tuned {XLSR}-53 large model for speech recognition in {C}hinese},
  author={Grosman, Jonatas},
  howpublished={\url{https://huggingface.co/wbbbbb/wav2vec2-large-chinese-zh-cn}},
  year={2021}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご