Open-source Model wav2vec2-xls-r-300m-timit-phoneme - Accurately Achieve English Speech Phoneme-level Recognition

Wav2vec2 Xls R 300m Timit Phoneme

Developed by vitouphy

This is an automatic phoneme recognition model fine-tuned on the TIMIT dataset based on the facebook/wav2vec2-xls-r-300m model, primarily used for phoneme-level recognition of English speech.

Speech Recognition

Transformers

EnglishOpen Source License:Apache-2.0 #English Phoneme Recognition #Low CER Accuracy #TIMIT Dataset

Downloads 8,457

Release Time : 5/8/2022

Model Overview

This model is specifically designed for English phoneme recognition tasks, trained on the TIMIT dataset, and capable of converting speech signals into corresponding phoneme sequences.

Model Features

High-Accuracy Phoneme Recognition

Achieves a character error rate (CER) of 7.996% on the TIMIT test set.

Based on Large-Scale Pretrained Model

Fine-tuned from the facebook/wav2vec2-xls-r-300m model, inheriting its powerful speech feature extraction capabilities.

End-to-End Processing Capability

Can directly process raw audio input without complex preprocessing steps.

Model Capabilities

English Phoneme Recognition

Speech Signal Processing

End-to-End Speech Recognition

Use Cases

Phonetics Research

Phoneme Analysis

Used in phonetics research to analyze pronunciation features and phoneme distribution.

Speech Recognition System Development

Speech Recognition Frontend

Serves as the phoneme recognition component in speech recognition systems.

🚀 wav2vec2-xls-r-300m-phoneme

This is a fine - tuned speech recognition model based on the wav2vec2-xls-r-300m architecture, trained on the Timit dataset to achieve high - quality phoneme recognition.

🚀 Quick Start

This model is a fine - tuned version of facebook/wav2vec2-xls-r-300m on the Timit dataset. Check this notebook for training detail.

💻 Usage Examples

Basic Usage

Using HuggingFace's pipeline, this will cover everything end - to - end from raw audio input to text output.

from transformers import pipeline

# Load the model
pipe = pipeline(model="vitouphy/wav2vec2-xls-r-300m-timit-phoneme")
# Process raw audio
output = pipe("audio_file.wav", chunk_length_s=10, stride_length_s=(4, 2))

Advanced Usage

A more custom way to predict phonemes.

from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC 
from datasets import load_dataset
import torch
import soundfile as sf

# load model and processor
processor = Wav2Vec2Processor.from_pretrained("vitouphy/wav2vec2-xls-r-300m-timit-phoneme")
model = Wav2Vec2ForCTC.from_pretrained("vitouphy/wav2vec2-xls-r-300m-timit-phoneme")

# Read and process the input
audio_input, sample_rate = sf.read("audio_file.wav")
inputs = processor(audio_input, sampling_rate=16_000, return_tensors="pt", padding=True)

with torch.no_grad():
    logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits

# Decode id into string
predicted_ids = torch.argmax(logits, axis=-1)      
predicted_sentences = processor.batch_decode(predicted_ids)
print(predicted_sentences)

📚 Documentation

Training and evaluation data

We use DARPA TIMIT dataset for this model.

We split into 80/10/10 for training, validation, and testing respectively.
That roughly corresponds to about 137/17/17 minutes.
The model obtained 7.996% on this test set.

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 3e - 05
train_batch_size: 8
eval_batch_size: 8
seed: 42
gradient_accumulation_steps: 4
total_train_batch_size: 32
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e - 08
lr_scheduler_type: linear
lr_scheduler_warmup_steps: 2000
training_steps: 10000
mixed_precision_training: Native AMP

Framework versions

Transformers 4.17.0.dev0
Pytorch 1.10.2+cu102
Datasets 1.18.2.dev0
Tokenizers 0.11.0

Citation

@misc { phy22-phoneme,
  author       = {Phy, Vitou},
  title        = {{Automatic Phoneme Recognition on TIMIT Dataset with Wav2Vec 2.0}},
  year         = 2022,
  note         = {{If you use this model, please cite it using these metadata.}},
  publisher    = {Hugging Face},
  version      = {1.0},
  doi          = {10.57967/hf/0125},
  url          = {https://huggingface.co/vitouphy/wav2vec2-xls-r-300m-timit-phoneme}
}

📄 License

This project is licensed under the Apache - 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご