wav2vec2-large-xlsr-53-japanese Open-source Model - Supports Japanese Speech Recognition, Compatible with 16kHz Audio

Home

Wav2vec2 Large Xlsr 53 Japanese

Developed by Ivydata

Japanese speech recognition model fine-tuned from facebook/wav2vec2-large-xlsr-53, supporting 16kHz sampled audio input

Speech Recognition

Transformers

JapaneseOpen Source License:Apache-2.0 #Japanese speech recognition #Low CER performance #Multi-dataset fine-tuning

Downloads 19

Release Time : 5/11/2023

Model Overview

This model is a speech recognition model fine-tuned on the XLSR-53 large model using Japanese datasets including Common Voice, JVS, and JSUT, specifically designed for Japanese speech-to-text tasks.

Model Features

Multi-dataset fine-tuning

Fine-tuned using three Japanese datasets (Common Voice, JVS, and JSUT) to enhance the model's Japanese speech recognition capability

No language model required

Can be used directly without additional language model support

High performance

Achieves CER of 27.87% on TEDxJP-10K dataset, outperforming other Japanese speech recognition models

Model Capabilities

Japanese speech recognition

16kHz audio processing

Real-time speech-to-text

Use Cases

Speech transcription

Japanese meeting minutes

Automatically convert Japanese meeting recordings into text transcripts

Approximately 72.13% accuracy (based on CER metric)

Japanese subtitle generation

Automatically generate subtitles for Japanese video content

Voice assistant

Japanese voice command recognition

Used for voice command recognition in Japanese voice assistants or smart home devices

🚀 Fine-tuned Japanese Wav2Vec2 model for speech recognition using XLSR-53 large

This project presents a fine - tuned Japanese Wav2Vec2 model for speech recognition. It leverages the XLSR - 53 large architecture and is trained on multiple Japanese datasets, providing a reliable solution for Japanese speech recognition tasks.

🚀 Quick Start

This model is a fine - tuned version of facebook/wav2vec2-large-xlsr-53 on Japanese, using Common Voice, JVS and JSUT. When using this model, ensure that your speech input is sampled at 16kHz.

💻 Usage Examples

Basic Usage

import torch
import librosa
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

LANG_ID = "ja"
MODEL_ID = "Ivydata/wav2vec2-large-xlsr-53-japanese"
SAMPLES = 10

test_dataset = load_dataset("common_voice", LANG_ID, split=f"test[:{SAMPLES}]")

processor = Wav2Vec2Processor.from_pretrained(MODEL_ID)
model = Wav2Vec2ForCTC.from_pretrained(MODEL_ID)

# Preprocessing the datasets.
# We need to read the audio files as arrays
def speech_file_to_array_fn(batch):
    speech_array, sampling_rate = librosa.load(batch["path"], sr=16_000)
    batch["speech"] = speech_array
    batch["sentence"] = batch["sentence"].upper()
    return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)
inputs = processor(test_dataset["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)

with torch.no_grad():
    logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits

predicted_ids = torch.argmax(logits, dim=-1)
predicted_sentences = processor.batch_decode(predicted_ids)

for i, predicted_sentence in enumerate(predicted_sentences):
    print("-" * 100)
    print("Reference: ", test_dataset[i]["sentence"])
    print("Prediction:", predicted_sentence)

📚 Documentation

Test Result

The following table shows the Character Error Rate (CER) of the model tested on the TEDxJP-10K dataset.

Model	CER
Ivydata/wav2vec2-large-xlsr-53-japanese	27.87%
jonatasgrosman/wav2vec2-large-xlsr-53-japanese	34.18%
vumichien/wav2vec2-large-xlsr-japanese	37.72%

Test Inference Examples

Reference	Prediction
ただ選択するのではなくどう考えて選択をするのか	ただ洗濯するのではなくどう考えて洗択をするのか
この巨大な構造物を宇宙に作ることができた人間	この巨大な構造物を宇宙に作ることができた人間
何かしら嫌いになっていってしまったわけですよね	何にかしら気段になっっていってしまったおけどすね
そんな僕だからこそ言えることは筋肉を変えれば自分が変わってくるし	んな僕らからこスえることは筋肉を変えれば自分が変わってくし
そうするとその言葉を使って未来のイメージを形作っていくことができると	そうするとその言葉を使って未来のイメーージを形作っていことができると

📄 License

This project is licensed under the Apache - 2.0 license.

📖 Citation

If you want to cite this model, you can use the following BibTeX entry:

@misc{Ivydata2023-wav2vec2-xlsr53-large-japanese,
  title={Fine-tuned Japanese Wav2Vec2 model for speech recognition using XLSR-53 large},
  author={Kosuke Suzuki},
  howpublished={\url{https://huggingface.co/Ivydata/wav2vec2-large-xlsr-53-japanese/}},
  year={2023}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご