đ hubert-base-korean
Hubert-base-korean is a speech representation learning model designed for Korean automatic speech recognition. It uses self - supervised learning to train directly on raw audio waveforms, offering a new approach to speech recognition.
đ Quick Start
đģ Usage Examples
Basic Usage
import torch
from transformers import HubertModel
model = HubertModel.from_pretrained("team-lucid/hubert-base-korean")
wav = torch.ones(1, 16000)
outputs = model(wav)
print(f"Input: {wav.shape}")
print(f"Output: {outputs.last_hidden_state.shape}")
Advanced Usage
import jax.numpy as jnp
from transformers import FlaxAutoModel
model = FlaxAutoModel.from_pretrained("team-lucid/hubert-base-korean", trust_remote_code=True)
wav = jnp.ones((1, 16000))
outputs = model(wav)
print(f"Input: {wav.shape}")
print(f"Output: {outputs.last_hidden_state.shape}")
⨠Features
Hubert (Hidden-Unit BERT) is a speech representation learning model proposed by Facebook. Unlike traditional speech recognition models, Hubert uses a self - supervised learning approach that directly learns from the raw waveform of speech signals. This research was trained on a Cloud TPU supported by Google's TPU Research Cloud (TRC).
Model Description
Property |
Base |
Large |
CNN Encoder - Strides |
5, 2, 2, 2, 2, 2, 2 |
5, 2, 2, 2, 2, 2, 2 |
CNN Encoder - Kernel Width |
10, 3, 3, 3, 3, 2, 2 |
10, 3, 3, 3, 3, 2, 2 |
CNN Encoder - Channel |
512 |
512 |
Transformer Encoder - Layer |
12 |
24 |
Transformer Encoder - Embedding Dim |
768 |
1024 |
Transformer Encoder - Inner FFN Dim |
3072 |
4096 |
Transformer Encoder - Attention Heads |
8 |
16 |
Projection - Dim |
256 |
768 |
Params |
95M |
317M |
đ§ Technical Details
Training Data
This model was trained on approximately 4,000 hours of data extracted from Free Conversation Speech (General Male and Female), Multi - Speaker Speech Synthesis Data, and Broadcast Content Dialogue - Style Speech Recognition Data, which were constructed with the support of the Institute for Information & communications Technology Planning & Evaluation under the Ministry of Science and ICT.
Training Procedure
Similar to the original paper, the Base model was first trained based on MFCC. Then, k - means clustering with 500 clusters was performed, and both the Base and Large models were retrained.
Training Hyperparameters
Hyperparameter |
Base |
Large |
Warmup Steps |
32,000 |
32,000 |
Learning Rates |
5e - 4 |
1.5e - 3 |
Batch Size |
128 |
128 |
Weight Decay |
0.01 |
0.01 |
Max Steps |
400,000 |
400,000 |
Learning Rate Decay |
0.1 |
0.1 |
\(Adam\beta_1\) |
0.9 |
0.9 |
\(Adam\beta_2\) |
0.99 |
0.99 |
đ License
This model is licensed under the Apache 2.0 license.