đ Fine-tuned Japanese Whisper model for speech recognition using whisper-small
This project presents a fine - tuned Japanese Whisper model for speech recognition. It fine - tunes the openai/whisper-small model on Japanese datasets, enabling accurate speech - to - text conversion in Japanese.
đ Quick Start
This model is fine - tuned on Japanese using Common Voice, JVS and JSUT. When using this model, ensure that your speech input is sampled at 16kHz.
⨠Features
- Fine - tuned on multiple Japanese datasets for better performance in Japanese speech recognition.
- Can be easily integrated into existing speech - recognition pipelines.
đĻ Installation
No specific installation steps are provided in the original document.
đģ Usage Examples
Basic Usage
from transformers import WhisperForConditionalGeneration, WhisperProcessor
from datasets import load_dataset
import librosa
import torch
LANG_ID = "ja"
MODEL_ID = "Ivydata/whisper-small-japanese"
SAMPLES = 10
test_dataset = load_dataset("common_voice", LANG_ID, split=f"test[:{SAMPLES}]")
processor = WhisperProcessor.from_pretrained(MODEL_ID)
model = WhisperForConditionalGeneration.from_pretrained(MODEL_ID)
model.config.forced_decoder_ids = processor.get_decoder_prompt_ids(
language="ja", task="transcribe"
)
model.config.suppress_tokens = []
def speech_file_to_array_fn(batch):
speech_array, sampling_rate = librosa.load(batch["path"], sr=16_000)
batch["speech"] = speech_array
batch["sentence"] = batch["sentence"].upper()
batch["sampling_rate"] = sampling_rate
return batch
test_dataset = test_dataset.map(speech_file_to_array_fn)
sample = test_dataset[0]
input_features = processor(sample["speech"], sampling_rate=sample["sampling_rate"], return_tensors="pt").input_features
predicted_ids = model.generate(input_features)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=False)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
đ Documentation
Test Result
In the table below, the Character Error Rate (CER) of the model tested on the TEDxJP - 10K dataset is reported.
Property |
Details |
Model |
Ivydata/whisper - small - japanese |
CER |
23.10% |
Model |
Ivydata/wav2vec2 - large - xlsr - 53 - japanese |
CER |
27.87% |
Model |
jonatasgrosman/wav2vec2 - large - xlsr - 53 - japanese |
CER |
34.18% |
đ License
This project is licensed under the Apache 2.0 license.