Open-source Phoneme Transcription Model xlsr-timit-b0 - Free Deployment to Realize English Audio-to-Phoneme Representation

Xlsr Timit B0

Developed by KoelLabs

A phoneme transcription model fine-tuned on the TIMIT dataset, capable of transcribing English audio into phoneme representations

Speech Recognition

Safetensors

English#English phoneme transcription #High-precision phonetic recognition #TIMIT dataset optimization

Downloads 40

Release Time : 11/30/2024

Model Overview

This model is based on the pre-trained checkpoint ginic/data_seed_4_wav2vec2-large-xlsr-buckeye-ipa and fine-tuned using the DARPA TIMIT English corpus. It can transcribe English audio into phoneme representations and outperforms all current XLSR models in English phonetic transcription tasks.

Model Features

High-precision phoneme transcription

Achieves an average character error rate (CER) of 0.113 on the TIMIT test set

English optimization

Specifically optimized for English speech with high phoneme transcription accuracy

Based on XLSR architecture

Built on the powerful wav2vec2-large-xlsr architecture with excellent speech feature extraction capabilities

Model Capabilities

English speech recognition

Phoneme transcription

Automatic speech transcription

Use Cases

Phonetics research

Phoneme analysis

Used for phoneme feature analysis in phonetics research

Provides accurate phoneme transcription results

Speech technology development

Speech recognition system development

Serves as a phoneme transcription component for speech recognition systems

Improves system accuracy in recognizing English phonemes

🚀 XLSR-TIMIT-B0: Fine-tuned on TIMIT for Phonemic Transcription

This model uses a pre - trained checkpoint to transcribe English audio into phonemic representations, offering high - performance results and being fine - tuned on a well - known English corpus.

🚀 Quick Start

To transcribe audio files, this model can be used as follows:

Basic Usage

from transformers import AutoModelForCTC, AutoProcessor
import torch

# Load model and processor
model = AutoModelForCTC.from_pretrained("KoelLabs/xlsr-timit-b0")
processor = AutoProcessor.from_pretrained("KoelLabs/xlsr-timit-b0")

# Prepare input
audio_input = "path_to_your_audio_file.wav"  # Replace with your file
input_values = processor(audio_input, return_tensors="pt", sampling_rate=16000).input_values

# Retrieve logits
with torch.no_grad():
    logits = model(input_values).logits

# Decode predictions
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)
print(transcription)

✨ Features

This model leverages the pretrained checkpoint ginic/data_seed_4_wav2vec2-large-xlsr-buckeye-ipa and is fine - tuned on the TIMIT Darpa English Corpus for English phonemic transcription.
All code is available on Github.
It outperforms all current xlsr IPA transcription models for English.

📚 Documentation

Performance

Training Loss: 1.254
Validation Loss: 0.267
Test Results (TIMIT test set):
- Average Weighted Distance: 13.309375
- Standard Deviation (Weighted Distance): 9.87
- Average Character Error Rate (CER): 0.113
- Standard Deviation (CER): 0.06

image/png

Model Information

Property	Details
Number of Epochs	40
Learning Rate	8e - 5
Optimizer	Adam
Datasets Used	TIMIT, Darpa English Corpus

Example Outputs

Prediction: lizteɪkðɪsdɹɾiteɪbklɔθiðiklinizfɹmi
Ground Truth: lizteɪkðɪsdɹɾiteɪbəklɔtiðiklinizfɹmi
Weighted Feature Edit Distance: 7.875
CER: 0.0556
Prediction: ɹænmʌðɹʔaʊtɹuhɹʔʌpɹɪŋiɾimpɛɾikoʊts
Ground Truth: ɹænmʌðɹʔaʊtɹuhɹʔʌpɹɪŋiŋinpɛɾikoʊts
Weighted Feature Edit Distance: 2.375
CER: 0.0588

🔧 Technical Details

The model is based on the pre - trained ginic/data_seed_4_wav2vec2-large-xlsr-buckeye-ipa and fine - tuned on the TIMIT Darpa English Corpus. It uses an Adam optimizer with a learning rate of 8e - 5 and is trained for 40 epochs.

📄 License

This model is licensed under the mpl - 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご