Phoneme Scorer v2 - Wav2Vec2 Open-source Speech Model: Accurately Achieve Phoneme Recognition!

Phoneme Scorer V2 Wav2vec2

Developed by ct-vikramanantha

An automatic speech recognition model based on Wav2Vec2-Base architecture, specifically fine-tuned for phoneme recognition on the LJSpeech Phonemes dataset

Speech Recognition

Transformers

EnglishOpen Source License:Apache-2.0 #Phoneme recognition #High-precision PER #English speech processing

Downloads 167

Release Time : 7/13/2024

Model Overview

This model is an automatic speech recognition (ASR) system focused on converting speech into phoneme sequences rather than word sequences. It uses International Phonetic Alphabet (IPA) phonemes as output units, suitable for speech processing tasks requiring phoneme-level analysis.

Model Features

Phoneme-level recognition

The model directly predicts International Phonetic Alphabet (IPA) phoneme sequences rather than traditional word sequences, making it suitable for speech processing tasks requiring phoneme analysis.

High accuracy

Achieves a phoneme error rate (PER) of 0.99% and a character error rate (CER) of 0.58% on the LJSpeech test set.

Based on Gruut phoneme set

Uses the International Phonetic Alphabet (IPA) phoneme set from the gruut project, supporting rich phoneme representation.

Model Capabilities

Speech to phoneme

Automatic speech recognition

Phoneme-level analysis

Use Cases

Speech processing

Phoneme analysis research

Used in linguistic research to analyze the phonemic composition of speech

Provides precise phoneme-level transcriptions

Speech synthesis preprocessing

Provides phoneme-level input for speech synthesis systems

Improves the accuracy and naturalness of synthesized speech

🚀 Wav2Vec2 LJSpeech Gruut

Wav2Vec2 LJSpeech Gruut is an automatic speech recognition model. It's based on the wav2vec 2.0 architecture and fine - tuned from Wav2Vec2 - Base on the LJSpech Phonemes dataset. Instead of predicting word sequences, it predicts phoneme sequences.

🚀 Quick Start

Prerequisites

Make sure you have installed the necessary libraries such as transformers, librosa, torch, and datasets.

Example Usage

from transformers import AutoProcessor, AutoModelForCTC, Wav2Vec2Processor
import librosa
import torch
from itertools import groupby
from datasets import load_dataset

def decode_phonemes(
    ids: torch.Tensor, processor: Wav2Vec2Processor, ignore_stress: bool = False
) -> str:
    """CTC-like decoding. First removes consecutive duplicates, then removes special tokens."""
    # removes consecutive duplicates
    ids = [id_ for id_, _ in groupby(ids)]

    special_token_ids = processor.tokenizer.all_special_ids + [
        processor.tokenizer.word_delimiter_token_id
    ]
    # converts id to token, skipping special tokens
    phonemes = [processor.decode(id_) for id_ in ids if id_ not in special_token_ids]

    # joins phonemes
    prediction = " ".join(phonemes)

    # whether to ignore IPA stress marks
    if ignore_stress == True:
        prediction = prediction.replace("ˈ", "").replace("ˌ", "")

    return prediction

checkpoint = "bookbot/wav2vec2-ljspeech-gruut"

model = AutoModelForCTC.from_pretrained(checkpoint)
processor = AutoProcessor.from_pretrained(checkpoint)
sr = processor.feature_extractor.sampling_rate

# load dummy dataset and read soundfiles
ds = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")
audio_array = ds[0]["audio"]["array"]

# or, read a single audio file
# audio_array, _ = librosa.load("myaudio.wav", sr=sr)

inputs = processor(audio_array, return_tensors="pt", padding=True)

with torch.no_grad():
    logits = model(inputs["input_values"]).logits

predicted_ids = torch.argmax(logits, dim=-1)
prediction = decode_phonemes(predicted_ids[0], processor, ignore_stress=True)
# => should give 'b ɪ k ʌ z j u ɚ z s l i p ɪ ŋ ɪ n s t ɛ d ə v k ɔ ŋ k ɚ ɪ ŋ ð ə l ʌ v l i ɹ z p ɹ ɪ n s ə s h æ z b ɪ k ʌ m ə v f ɪ t ə l w ɪ θ n b oʊ p ɹ ə ʃ æ ɡ i s ɪ t s ð ɛ ɹ ə k u ɪ ŋ d ʌ v'

✨ Features

Phoneme Prediction: Trained to predict sequences of phonemes instead of words.
Fine - Tuned Model: Based on the Wav2Vec2 - Base model, fine - tuned on the LJSpeech Phonemes dataset.

📦 Installation

The installation steps are not provided in the original README. You can install the required libraries using pip or conda according to the usage example. For example:

pip install transformers librosa torch datasets

💻 Usage Examples

Basic Usage

# The basic usage code is the same as the quick start example
from transformers import AutoProcessor, AutoModelForCTC, Wav2Vec2Processor
import librosa
import torch
from itertools import groupby
from datasets import load_dataset

def decode_phonemes(
    ids: torch.Tensor, processor: Wav2Vec2Processor, ignore_stress: bool = False
) -> str:
    """CTC-like decoding. First removes consecutive duplicates, then removes special tokens."""
    # removes consecutive duplicates
    ids = [id_ for id_, _ in groupby(ids)]

    special_token_ids = processor.tokenizer.all_special_ids + [
        processor.tokenizer.word_delimiter_token_id
    ]
    # converts id to token, skipping special tokens
    phonemes = [processor.decode(id_) for id_ in ids if id_ not in special_token_ids]

    # joins phonemes
    prediction = " ".join(phonemes)

    # whether to ignore IPA stress marks
    if ignore_stress == True:
        prediction = prediction.replace("ˈ", "").replace("ˌ", "")

    return prediction

checkpoint = "bookbot/wav2vec2-ljspeech-gruut"

model = AutoModelForCTC.from_pretrained(checkpoint)
processor = AutoProcessor.from_pretrained(checkpoint)
sr = processor.feature_extractor.sampling_rate

# load dummy dataset and read soundfiles
ds = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")
audio_array = ds[0]["audio"]["array"]

# or, read a single audio file
# audio_array, _ = librosa.load("myaudio.wav", sr=sr)

inputs = processor(audio_array, return_tensors="pt", padding=True)

with torch.no_grad():
    logits = model(inputs["input_values"]).logits

predicted_ids = torch.argmax(logits, dim=-1)
prediction = decode_phonemes(predicted_ids[0], processor, ignore_stress=True)

Advanced Usage

There is no advanced usage example in the original README. You can further explore the model by adjusting the hyperparameters during training or using different datasets for inference.

📚 Documentation

Model Information

Property	Details
Model Type	Automatic Speech Recognition (Phoneme Prediction)
Base Model	Wav2Vec2 - Base
Training Data	LJSpech Phonemes
Vocabulary	vocab.json

Evaluation Results

Dataset	Test PER (w/o stress)	Test CER (w/o stress)
`LJSpech Phonemes` Test Data	0.0099	0.0058

🔧 Technical Details

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 0.0001
train_batch_size: 16
eval_batch_size: 8
seed: 42
gradient_accumulation_steps: 2
total_train_batch_size: 32
optimizer: Adam with betas=(0.9,0.999) and epsilon = 1e-08
lr_scheduler_type: linear
lr_scheduler_warmup_steps: 1000
num_epochs: 30.0
mixed_precision_training: Native AMP

Training results

Training Loss	Epoch	Step	Validation Loss	Wer	Cer
No log	1.0	348	2.2818	1.0	1.0
2.6692	2.0	696	0.2045	0.0527	0.0299
0.2225	3.0	1044	0.1162	0.0319	0.0189
0.2225	4.0	1392	0.0927	0.0235	0.0147
0.0868	5.0	1740	0.0797	0.0218	0.0143
0.0598	6.0	2088	0.0715	0.0197	0.0128
0.0598	7.0	2436	0.0652	0.0160	0.0103
0.0447	8.0	2784	0.0571	0.0152	0.0095
0.0368	9.0	3132	0.0608	0.0163	0.0112
0.0368	10.0	3480	0.0586	0.0137	0.0083
0.0303	11.0	3828	0.0641	0.0141	0.0085
0.0273	12.0	4176	0.0656	0.0131	0.0079
0.0232	13.0	4524	0.0690	0.0133	0.0082
0.0232	14.0	4872	0.0598	0.0128	0.0079
0.0189	15.0	5220	0.0671	0.0121	0.0074
0.017	16.0	5568	0.0654	0.0114	0.0069
0.017	17.0	5916	0.0751	0.0118	0.0073
0.0146	18.0	6264	0.0653	0.0112	0.0068
0.0127	19.0	6612	0.0682	0.0112	0.0069
0.0127	20.0	6960	0.0678	0.0114	0.0068
0.0114	21.0	7308	0.0656	0.0111	0.0066
0.0101	22.0	7656	0.0669	0.0109	0.0066
0.0092	23.0	8004	0.0677	0.0108	0.0065
0.0092	24.0	8352	0.0653	0.0104	0.0063
0.0088	25.0	8700	0.0673	0.0102	0.0063
0.0074	26.0	9048	0.0669	0.0105	0.0064
0.0074	27.0	9396	0.0707	0.0101	0.0061
0.0066	28.0	9744	0.0673	0.0100	0.0060
0.0058	29.0	10092	0.0689	0.0100	0.0059
0.0058	30.0	10440	0.0683	0.0099	0.0058

📄 License

This model is licensed under the Apache 2.0 license.

⚠️ Important Note

Do consider the biases which came from pre - training datasets that may be carried over into the results of this model.

👥 Authors

Wav2Vec2 LJSpeech Gruut was trained and evaluated by Wilson Wongso. All computation and development are done on Google Cloud.

🛠️ Framework versions

Transformers 4.26.0.dev0
Pytorch 1.10.0
Datasets 2.7.1
Tokenizers 0.13.2
Gruut 2.3.4

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご