whisper-small-japanese Open-source Japanese Speech Recognition Model

Whisper Small Japanese

Developed by Ivydata

This model is a Japanese speech recognition model fine-tuned based on openai/whisper-small, supporting Japanese speech-to-text tasks.

Speech Recognition

Transformers

JapaneseOpen Source License:Apache-2.0 #Japanese speech recognition #Low CER #Multi-dataset training

Downloads 356

Release Time : 5/19/2023

Model Overview

The openai/whisper-small model is fine-tuned for Japanese using the Common Voice, JVS, and JSUT datasets, suitable for Japanese speech recognition tasks.

Model Features

Japanese optimization

Specially fine-tuned for Japanese speech, with better recognition performance than general models

Multi-dataset training

Trained with multiple Japanese datasets including Common Voice, JVS, and JSUT

16kHz sampling rate support

Supports speech input with a 16kHz sampling rate

Model Capabilities

Japanese speech recognition

Speech-to-text

Use Cases

Speech transcription

Japanese meeting minutes

Convert Japanese meeting recordings into text transcripts

Japanese subtitle generation

Automatically generate subtitles for Japanese video content

🚀 Fine-tuned Japanese Whisper model for speech recognition using whisper-small

This project presents a fine - tuned Japanese Whisper model for speech recognition. It fine - tunes the openai/whisper-small model on Japanese datasets, enabling accurate speech - to - text conversion in Japanese.

🚀 Quick Start

This model is fine - tuned on Japanese using Common Voice, JVS and JSUT. When using this model, ensure that your speech input is sampled at 16kHz.

✨ Features

Fine - tuned on multiple Japanese datasets for better performance in Japanese speech recognition.
Can be easily integrated into existing speech - recognition pipelines.

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

from transformers import WhisperForConditionalGeneration, WhisperProcessor
from datasets import load_dataset
import librosa
import torch

LANG_ID = "ja"
MODEL_ID = "Ivydata/whisper-small-japanese"
SAMPLES = 10

test_dataset = load_dataset("common_voice", LANG_ID, split=f"test[:{SAMPLES}]")
processor = WhisperProcessor.from_pretrained(MODEL_ID)
model = WhisperForConditionalGeneration.from_pretrained(MODEL_ID)
model.config.forced_decoder_ids = processor.get_decoder_prompt_ids(
    language="ja", task="transcribe"
)
model.config.suppress_tokens = []

# Preprocessing the datasets.
# We need to read the audio files as arrays
def speech_file_to_array_fn(batch):
    speech_array, sampling_rate = librosa.load(batch["path"], sr=16_000)
    batch["speech"] = speech_array
    batch["sentence"] = batch["sentence"].upper()
    batch["sampling_rate"] = sampling_rate
    return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)
sample = test_dataset[0]
input_features = processor(sample["speech"], sampling_rate=sample["sampling_rate"], return_tensors="pt").input_features
predicted_ids = model.generate(input_features)

transcription = processor.batch_decode(predicted_ids, skip_special_tokens=False)
# ['<|startoftranscript|><|ja|><|transcribe|><|notimestamps|>木村さんに電話を貸してもらいました。<|endoftext|>']

transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
# ['木村さんに電話を貸してもらいました。']

📚 Documentation

Test Result

In the table below, the Character Error Rate (CER) of the model tested on the TEDxJP - 10K dataset is reported.

Property	Details
Model	Ivydata/whisper - small - japanese
CER	23.10%
Model	Ivydata/wav2vec2 - large - xlsr - 53 - japanese
CER	27.87%
Model	jonatasgrosman/wav2vec2 - large - xlsr - 53 - japanese
CER	34.18%

📄 License

This project is licensed under the Apache 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご