Free and open source! The w2v_hf_commonvoice_from_xlsr53_pretrain_0329UTC1500 model enables Japanese speech recognition

Home

W2v Hf Commonvoice From Xlsr53 Pretrain 0329UTC1500

Developed by qqpann

A speech recognition model fine-tuned on the Common Voice Japanese dataset based on facebook/wav2vec2-large-xlsr-53

Speech Recognition

Transformers

#Japanese speech recognition #XLSR large model fine-tuning #No language model dependency

Downloads 15

Release Time : 3/2/2022

Model Overview

This is a model for Japanese automatic speech recognition (ASR), fine-tuned based on the XLSR architecture, supporting voice input with a 16kHz sampling rate

Model Features

Japanese speech recognition

Speech recognition capability specifically optimized for Japanese

Based on XLSR architecture

Model architecture pre-trained using large-scale cross-lingual representation learning

No language model required

Can be used directly without additional language model support

Model Capabilities

Japanese speech-to-text

Automatic speech recognition

16kHz audio processing

Use Cases

Speech transcription

Japanese speech transcription

Convert Japanese speech content into text

Word error rate 70.18%

Voice assistant

Japanese voice command recognition

Recognize Japanese voice commands

🚀 Wav2Vec2-Large-XLSR-53-Japanese

This model is fine-tuned from facebook/wav2vec2-large-xlsr-53 on Japanese using datasets like Common Voice. It's designed for automatic speech recognition tasks.

Dataset and Metrics

Property	Details
Datasets	common_voice; TODO: add more datasets if you have used additional datasets. Make sure to use the exact same dataset name as the one found here. If the dataset can not be found in the official datasets, just give it a new name
Metrics	wer, cer
Tags	audio, automatic-speech-recognition, speech, xlsr-fine-tuning-week
License	apache-2.0

Model Results

Task	Dataset	Metrics
Speech Recognition (automatic-speech-recognition)	Common Voice ja (type: common_voice, args: ja)	Test WER: 70.1869

🚀 Quick Start

When using this model, make sure that your speech input is sampled at 16kHz.

✨ Features

Fine-tuned on Japanese language data, suitable for Japanese speech recognition tasks.
Can be used directly without a language model.

📦 Installation

No specific installation steps are provided in the original README.

💻 Usage Examples

Basic Usage

import torch
import torchaudio
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

test_dataset = load_dataset("common_voice", "ja", split="test[:2%]")

processor = Wav2Vec2Processor.from_pretrained("qqhann/w2v_hf_commonvoice_from_xlsr53_pretrain_0329UTC1500")
model = Wav2Vec2ForCTC.from_pretrained("qqhann/w2v_hf_commonvoice_from_xlsr53_pretrain_0329UTC1500")

resampler = torchaudio.transforms.Resample(48_000, 16_000)

# Preprocessing the datasets.
# We need to read the aduio files as arrays
def speech_file_to_array_fn(batch):
    speech_array, sampling_rate = torchaudio.load(batch["path"])
    batch["speech"] = resampler(speech_array).squeeze().numpy()
    return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)
inputs = processor(test_dataset["speech"][:2], sampling_rate=16_000, return_tensors="pt", padding=True)

with torch.no_grad():
    logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits

predicted_ids = torch.argmax(logits, dim=-1)

print("Prediction:", processor.batch_decode(predicted_ids))
print("Reference:", test_dataset["sentence"][:2])

Advanced Usage

The model can be evaluated as follows on the Japanese test data of Common Voice.

import torch
import torchaudio
from datasets import load_dataset, load_metric
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import re

test_dataset = load_dataset("common_voice", "ja", split="test")
wer = load_metric("wer")

processor = Wav2Vec2Processor.from_pretrained("qqhann/w2v_hf_commonvoice_from_xlsr53_pretrain_0329UTC1500")
model = Wav2Vec2ForCTC.from_pretrained("qqhann/w2v_hf_commonvoice_from_xlsr53_pretrain_0329UTC1500")
model.to("cuda")

chars_to_ignore_regex = '[\,\?\.\!\-\;\:\"\“]'  # TODO: adapt this list to include all special characters you removed from the data
resampler = torchaudio.transforms.Resample(48_000, 16_000)

# Preprocessing the datasets.
# We need to read the aduio files as arrays
def speech_file_to_array_fn(batch):
    batch["sentence"] = re.sub(chars_to_ignore_regex, '', batch["sentence"]).lower()
    speech_array, sampling_rate = torchaudio.load(batch["path"])
    batch["speech"] = resampler(speech_array).squeeze().numpy()
    return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)

# Preprocessing the datasets.
# We need to read the aduio files as arrays
def evaluate(batch):
    inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)

    with torch.no_grad():
        logits = model(inputs.input_values.to("cuda"), attention_mask=inputs.attention_mask.to("cuda")).logits

    pred_ids = torch.argmax(logits, dim=-1)
    batch["pred_strings"] = processor.batch_decode(pred_ids)
    return batch

result = test_dataset.map(evaluate, batched=True, batch_size=8)

print("WER: {:2f}".format(100 * wer.compute(predictions=result["pred_strings"], references=result["sentence"])))

Test Result: 70.18 %

🔧 Technical Details

The Common Voice train, validation, and ... datasets were used for training as well as ... and ...

The script used for training can be found here

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご