Open-source xlsr-timit-a0 Model - Efficiently Convert English Audio into Phoneme Representations!

Xlsr Timit A0

Developed by KoelLabs

A phoneme transcription model fine-tuned on the TIMIT English corpus based on the XLSR pre-trained model, used to convert English audio into phoneme representations.

Speech Recognition

Safetensors

English#English phoneme transcription #Low CER recognition #TIMIT fine-tuning

Downloads 17

Release Time : 12/1/2024

Model Overview

This model is specifically designed for phoneme-level automatic speech recognition (ASR) of English audio, capable of converting speech signals into sequences of International Phonetic Alphabet (IPA) symbols.

Model Features

High-Accuracy Phoneme Transcription

Achieves an average Character Error Rate (CER) of 0.14 on the TIMIT test set

Professional Phonetic Annotation

Outputs International Phonetic Alphabet (IPA) symbols, suitable for phonetic research

Lightweight Fine-tuning

Efficient fine-tuning based on the pre-trained XLSR model, requiring only 40 training epochs

Model Capabilities

English speech recognition

Phoneme-level transcription

International Phonetic Alphabet conversion

Use Cases

Phonetics Research

Phoneme Analysis

Automatically generate phoneme annotations for speech samples

Provides speech analysis results accurate to the phoneme level

Speech Technology Development

ASR System Pre-training

Used as a phoneme feature extractor for speech recognition systems

Improves performance in downstream ASR tasks

🚀 XLSR-TIMIT-B0: Fine-tuned on TIMIT for Phonemic Transcription

This model fine-tunes a pre - trained checkpoint on the TIMIT corpus to transcribe English audio into phonemic representations.

🚀 Quick Start

To transcribe audio files, this model can be used as follows:

from transformers import AutoModelForCTC, AutoProcessor
import torch

# Load model and processor
model = AutoModelForCTC.from_pretrained("KoelLabs/xlsr-timit-b0")
processor = AutoProcessor.from_pretrained("KoelLabs/xlsr-timit-b0")

# Prepare input
audio_input = "path_to_your_audio_file.wav"  # Replace with your file
input_values = processor(audio_input, return_tensors="pt", sampling_rate=16000).input_values

# Retrieve logits
with torch.no_grad():
    logits = model(input_values).logits

# Decode predictions
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)
print(transcription)

✨ Features

This model leverages the pretrained checkpoint ginic/hyperparam_tuning_1_wav2vec2-large-xlsr-buckeye-ipa and is fine - tuned on the TIMIT Darpa English Corpus to transcribe audio into phonemic representations for the English language.

📚 Documentation

Performance

Training Loss: 4.73
Validation Loss: 1.048
Test Results (TIMIT test set):
- Average Weighted Distance: 18.06
- Standard Deviation (Weighted Distance): 12.9
- Average Character Error Rate (CER): 0.14
- Standard Deviation (CER): 0.07

Model Information

Property	Details
Base Model	ginic/hyperparam_tuning_1_wav2vec2-large-xlsr-buckeye-ipa
Language	English
License	mpl - 2.0
Metrics	CER
Pipeline Tag	Automatic Speech Recognition
Number of Epochs	40
Learning Rate	5e-6
Optimizer	Adam
Datasets Used	TIMIT, Darpa English Corpus

Example Outputs

Prediction: lizteɪkðɪsdɹɾiteɪbklɔθiðiklinizfɹmi
Ground Truth: lizteɪkðɪsdɹɾiteɪbəklɔtiðiklinizfɹmi
Weighted Feature Edit Distance: 7.875
CER: 0.0556
Prediction: ɹænmʌðɹʔaʊtɹuhɹʔʌpɹɪŋiɾimpɛɾikoʊts
Ground Truth: ɹænmʌðɹʔaʊtɹuhɹʔʌpɹɪŋiŋinpɛɾikoʊts
Weighted Feature Edit Distance: 2.375
CER: 0.0588

🔧 Technical Details

This phonemic transcription model is fine - tuned on an English speech corpus that does not encompass all dialects and languages. We acknowledge that it may significantly underperform for any unseen languages. We aim to release models and datasets that better serve all populations and languages in the future.

📄 License

This model is released under the mpl - 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご