đ Whisper Fine-tuned Pronunciation Scorer
This model is designed to assess the pronunciation quality of Korean speech. It's built upon the openai/whisper-small model and fine - tuned with the Korea AI - Hub dataset.
⨠Features
- Pronunciation Assessment: This model assesses the pronunciation quality of Korean speech, providing a score from 1 to 5.
- Encoder - Decoder Architecture: It uses the encoder - decoder architecture of the Whisper model to extract speech features and an additional linear layer to predict the pronunciation score.
đĻ Installation
To use this model, you need to install the required libraries. Although the original text doesn't provide specific installation commands, typically, you would install libraries like torch
, torchaudio
, and transformers
using pip
or conda
. For example:
pip install torch torchaudio transformers
đģ Usage Examples
Basic Usage
import torch
import torchaudio
from transformers import WhisperProcessor, WhisperForConditionalGeneration
import torch.nn as nn
class WhisperPronunciationScorer(nn.Module):
def __init__(self, pretrained_model):
super().__init__()
self.whisper = pretrained_model
self.score_head = nn.Linear(self.whisper.config.d_model, 1)
def forward(self, input_features, labels=None):
outputs = self.whisper(input_features, labels=labels, output_hidden_states=True)
last_hidden_state = outputs.decoder_hidden_states[-1]
scores = self.score_head(last_hidden_state.mean(dim=1)).squeeze()
return scores
def load_model(model_path, device):
model_name = "openai/whisper-small"
processor = WhisperProcessor.from_pretrained(model_name)
pretrained_model = WhisperForConditionalGeneration.from_pretrained(model_name)
model = WhisperPronunciationScorer(pretrained_model).to(device)
model.load_state_dict(torch.load(model_path, map_location=device))
model.eval()
return model, processor
def predict_pronunciation_score(model, processor, audio_path, transcript, device):
audio, sr = torchaudio.load(audio_path)
if sr != 16000:
audio = torchaudio.functional.resample(audio, sr, 16000)
input_features = processor(audio.squeeze().numpy(), sampling_rate=16000, return_tensors="pt").input_features.to(device)
labels = processor(text=transcript, return_tensors="pt").input_ids.to(device)
with torch.no_grad():
score = model(input_features, labels)
return score.item()
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_path = "path/to/your/model.pth"
model, processor = load_model(model_path, device)
audio_path = "path/to/your/audio.wav"
transcript = "ėë
íė¸ė"
score = predict_pronunciation_score(model, processor, audio_path, transcript, device)
print(f"Predicted pronunciation score: {score:.2f}")
đ Documentation
Model Description
The Pronunciation Scorer takes audio input along with its corresponding text transcript and provides a Korean pronunciation score on a scale of 1 to 5. It utilizes the encoder - decoder architecture of the Whisper model to extract speech features and employs an additional linear layer to predict the pronunciation score.
How to Use
To use this model, follow these steps:
- Install required libraries
- Load the model and processor
- Prepare your audio file and text transcript
- Predict the pronunciation score
đ License
This model is released under the Apache 2.0 license.
Additional Information
Property |
Details |
Model Type |
Whisper Fine - tuned Pronunciation Scorer |
Training Data |
Korea AI - Hub (https://www.aihub.or.kr/) foreigner Korean pronunciation evaluation dataset |
Metrics |
1~5 |
Pipeline Tag |
audio - classification |