Whisper-base-japanese Open-source Model - Free to Deploy and Use for Japanese Speech Recognition Tasks

Whisper Base Japanese

Developed by Ivydata

This model is fine-tuned on the Common Voice, JVS, and JSUT datasets for Japanese speech recognition tasks using openai/whisper-base.

Speech Recognition

Transformers

JapaneseOpen Source License:Apache-2.0 #Japanese Speech Recognition #Low Error Rate #Multi-dataset Training

Downloads 137

Release Time : 5/17/2023

Model Overview

This is a Japanese speech recognition model based on the Whisper architecture, specifically optimized for Japanese speech to convert it into text.

Model Features

Japanese Optimization

Fine-tuned specifically for Japanese speech characteristics to improve recognition accuracy.

Multi-dataset Training

Trained on three Japanese datasets—Common Voice, JVS, and JSUT—covering various speech scenarios.

16kHz Sampling Rate Support

Supports 16kHz sampling rate audio input, suitable for most speech applications.

Model Capabilities

Japanese Speech-to-Text

Continuous Speech Recognition

General Speech Transcription

Use Cases

Speech Transcription

Japanese Meeting Minutes

Automatically transcribe Japanese meeting recordings into text records.

Japanese Voice Assistant

Provide speech recognition capabilities for Japanese voice assistants.

Education

Japanese Learning Aid

Assist Japanese learners by transcribing spoken practice into text.

🚀 Fine-tuned Japanese Whisper model for speech recognition using whisper-base

This is a fine-tuned Japanese Whisper model for speech recognition. It's based on the openai/whisper-base model and trained on Japanese datasets, offering high - quality speech recognition capabilities.

🚀 Quick Start

This is a fine - tuned openai/whisper-base model on Japanese using Common Voice, JVS and JSUT. When using this model, make sure that your speech input is sampled at 16kHz.

💻 Usage Examples

Basic Usage

from transformers import WhisperForConditionalGeneration, WhisperProcessor
from datasets import load_dataset
import librosa
import torch

LANG_ID = "ja"
MODEL_ID = "Ivydata/whisper-base-japanese"
SAMPLES = 10

test_dataset = load_dataset("common_voice", LANG_ID, split=f"test[:{SAMPLES}]")
processor = WhisperProcessor.from_pretrained("openai/whisper-base")
model = WhisperForConditionalGeneration.from_pretrained(MODEL_ID)
model.config.forced_decoder_ids = processor.get_decoder_prompt_ids(
    language="ja", task="transcribe"
)
model.config.suppress_tokens = []

# Preprocessing the datasets.
# We need to read the audio files as arrays
def speech_file_to_array_fn(batch):
    speech_array, sampling_rate = librosa.load(batch["path"], sr=16_000)
    batch["speech"] = speech_array
    batch["sentence"] = batch["sentence"].upper()
    batch["sampling_rate"] = sampling_rate
    return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)
sample = test_dataset[0]
input_features = processor(sample["speech"], sampling_rate=sample["sampling_rate"], return_tensors="pt").input_features
predicted_ids = model.generate(input_features)

transcription = processor.batch_decode(predicted_ids, skip_special_tokens=False)
# ['<|startoftranscript|><|ja|><|transcribe|><|notimestamps|>木村さんに電話を貸してもらいました。<|endoftext|>']

transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
# ['木村さんに電話を貸してもらいました。']

📚 Documentation

In the table below I report the Character Error Rate (CER) of the model tested on TEDxJP-10K dataset.

Property	Details
Model	CER
Ivydata/whisper-small-japanese	27.25%
Ivydata/wav2vec2-large-xlsr-53-japanese	27.87%
jonatasgrosman/wav2vec2-large-xlsr-53-japanese	34.18%

📄 License

This project is licensed under the Apache - 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご