Open-source model of japanese-hubert-base-phoneme-ctc - A powerful tool to improve the accuracy of Japanese speech recognition

Japanese Hubert Base Phoneme Ctc

Developed by prj-beatrice

This model is a fine-tuned model for Japanese phoneme recognition using CTC based on rinna/japanese-hubert-base, which can effectively improve the accuracy of Japanese speech recognition.

Speech Recognition

Transformers

JapaneseOpen Source License:Apache-2.0 #Japanese speech recognition #Phoneme-level CTC #ReazonSpeech fine-tuning

Downloads 144

Release Time : 6/21/2025

Model Overview

This model is fine-tuned using the ReazonSpeech v2 dataset, with the phoneme labels generated by pyopenjtalk-plus as the correct answers, focusing on improving the accuracy of Japanese speech recognition tasks.

Model Features

Efficient fine-tuning

Based on the rinna/japanese-hubert-base model, efficient fine-tuning is performed using CTC, focusing on Japanese phoneme recognition tasks.

High-quality dataset

Trained using the ReazonSpeech v2 dataset and phoneme labels generated by pyopenjtalk-plus to ensure data quality.

Optimal selection

After about 0.3 epochs of learning, the checkpoint with the best accuracy on the JSUT corpus is selected to ensure model performance.

Model Capabilities

Japanese phoneme recognition

Speech-to-text

Japanese speech processing

Use Cases

Speech recognition

Japanese speech-to-text

Convert Japanese speech into a phoneme sequence for subsequent processing and analysis.

The output is a phoneme sequence, such as'm i z u o m a r e e sh i a k a r a k a w a n a k U t e w a n a r a n a i n o d e s U'

Speech processing

Japanese speech analysis

Used to analyze the phoneme distribution and patterns in Japanese speech.

🚀 Japanese Hubert Base Phoneme CTC

This model is a fine-tuned version of rinna/japanese-hubert-base for Japanese phoneme recognition using CTC.

🚀 Quick Start

This model is a fine-tuned version of rinna/japanese-hubert-base for Japanese phoneme recognition using CTC.

✨ Features

Model Overview

Fine-tuned rinna/japanese-hubert-base using the ReazonSpeech v2 dataset, treating the phoneme labels generated by pyopenjtalk-plus as ground truth.
After training for about 0.3 epochs, the checkpoint with the best accuracy on the JSUT corpus (labels: https://github.com/sarulab-speech/jsut-label) was selected.

Hyperparameters

Learning Rate
- CTC Head: 2e-5
- Others: 2e-6
Batch Size: 32
Maximum Audio Samples: 250000
Optimization: AdamW
- betas: (0.9, 0.98)
- weight_decay: 0.01
Learning Rate Scheduling: Cosine
- Warmup Steps: 10000
- Maximum Steps: 800000
  - However, training was stopped at 200000 steps because the accuracy on JSUT did not improve.

💻 Usage Examples

Basic Usage

import librosa
import numpy as np
import torch
from transformers import HubertForCTC, Wav2Vec2Processor

MODEL_NAME = "prj-beatrice/japanese-hubert-base-phoneme-ctc"
model = HubertForCTC.from_pretrained(MODEL_NAME)
processor = Wav2Vec2Processor.from_pretrained(MODEL_NAME)

audio, sr = librosa.load("audio.wav", sr=16000)
audio = np.concatenate([np.zeros(sr), audio, np.zeros(sr // 2)])

inputs = processor(audio, sampling_rate=sr, return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs)
    predicted_ids = outputs.logits.argmax(-1)
    phonemes = processor.decode(predicted_ids[0], spaces_between_special_tokens=True)

print(phonemes)
# => "m i z u o m a r e e sh i a k a r a k a w a n a k U t e w a n a r a n a i n o d e s U"

📚 Documentation

Training Environment

A100 80GB
Python 3.10.12

absl-py==2.3.0
accelerate==1.7.0
aiohappyeyeballs==2.6.1
aiohttp==3.12.13
aiosignal==1.3.2
annotated-types==0.7.0
async-timeout==5.0.1
attrs==25.3.0
audioread==3.0.1
certifi==2025.6.15
cffi==1.17.1
charset-normalizer==3.4.2
click==8.2.1
coloredlogs==15.0.1
coverage==7.9.1
datasets==3.6.0
decorator==5.2.1
dill==0.3.8
evaluate==0.4.3
exceptiongroup==1.3.0
filelock==3.18.0
flatbuffers==25.2.10
frozenlist==1.7.0
fsspec==2025.3.0
gitdb==4.0.12
gitpython==3.1.44
grpcio==1.73.0
hf-xet==1.1.3
huggingface-hub==0.33.0
humanfriendly==10.0
idna==3.10
iniconfig==2.1.0
jinja2==3.1.6
jiwer==3.1.0
joblib==1.5.1
lazy-loader==0.4
librosa==0.11.0
llvmlite==0.44.0
markdown==3.8
markupsafe==3.0.2
mpmath==1.3.0
msgpack==1.1.1
multidict==6.4.4
multiprocess==0.70.16
networkx==3.4.2
numba==0.61.2
numpy==2.2.6
nvidia-cublas-cu12==12.6.4.1
nvidia-cuda-cupti-cu12==12.6.80
nvidia-cuda-nvrtc-cu12==12.6.77
nvidia-cuda-runtime-cu12==12.6.77
nvidia-cudnn-cu12==9.5.1.17
nvidia-cufft-cu12==11.3.0.4
nvidia-cufile-cu12==1.11.1.6
nvidia-curand-cu12==10.3.7.77
nvidia-cusolver-cu12==11.7.1.2
nvidia-cusparse-cu12==12.5.4.2
nvidia-cusparselt-cu12==0.6.3
nvidia-nccl-cu12==2.26.2
nvidia-nvjitlink-cu12==12.6.85
nvidia-nvtx-cu12==12.6.77
onnxruntime==1.22.0
packaging==25.0
pandas==2.3.0
platformdirs==4.3.8
pluggy==1.6.0
pooch==1.8.2
propcache==0.3.2
protobuf==6.31.1
psutil==7.0.0
pyarrow==20.0.0
pycparser==2.22
pydantic==2.11.7
pydantic-core==2.33.2
pygments==2.19.1
pyopenjtalk-plus==0.4.1.post3
pytest==8.4.0
pytest-cov==6.2.1
python-dateutil==2.9.0.post0
pytz==2025.2
pyyaml==6.0.2
rapidfuzz==3.13.0
regex==2024.11.6
requests==2.32.4
ruff==0.11.13
safetensors==0.5.3
scikit-learn==1.7.0
scipy==1.15.3
sentry-sdk==2.30.0
setproctitle==1.3.6
setuptools==80.9.0
six==1.17.0
smmap==5.0.2
soundfile==0.13.1
soxr==0.5.0.post1
sudachidict-core==20250515
sudachipy==0.6.10
sympy==1.14.0
tensorboard==2.19.0
tensorboard-data-server==0.7.2
threadpoolctl==3.6.0
tokenizers==0.21.1
tomli==2.2.1
torch==2.7.1
torchaudio==2.7.1
tqdm==4.67.1
transformers==4.52.4
triton==3.3.1
typing-extensions==4.14.0
typing-inspection==0.4.1
tzdata==2025.2
urllib3==2.4.0
wandb==0.20.1
werkzeug==3.1.3
xxhash==3.5.0
yarl==1.20.1

📄 License

This project is licensed under the Apache-2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご