đ AST Fine-Tuned Model for Emotion Classification
This is a fine - tuned Audio Spectrogram Transformer (AST) model designed for classifying emotions in speech audio. It was fine - tuned on the CREMA - D dataset, focusing on six emotional categories. The base model is from MIT's pre - trained AST model.
đ Quick Start
To use this model for emotion classification, you can follow these steps:
from transformers import AutoModelForAudioClassification, AutoProcessor
model = AutoModelForAudioClassification.from_pretrained("forwarder1121/ast-finetuned-model")
processor = AutoProcessor.from_pretrained("forwarder1121/ast-finetuned-model")
inputs = processor("path_to_audio.wav", sampling_rate=16000, return_tensors="pt")
outputs = model(**inputs)
predicted_class = outputs.logits.argmax(-1).item()
print(f"Predicted emotion: {model.config.id2label[str(predicted_class)]}")
⨠Features
- Emotion Classification: Specifically designed to classify six different emotions in speech audio.
- Fine - Tuned on CREMA - D: Trained on a well - known emotion - labeled speech dataset.
- Based on AST Architecture: Utilizes the Audio Spectrogram Transformer architecture.
đĻ Installation
Since this model is based on the transformers
library, you can install it using the following command:
pip install transformers
đ Documentation
Model Details
Property |
Details |
Base Model |
MIT/ast-finetuned-audioset-10-10-0.4593 |
Fine - Tuned Dataset |
CREMA - D |
Architecture |
Audio Spectrogram Transformer (AST) |
Model Type |
Single - label classification |
Input Features |
Log - Mel Spectrograms (128 mel bins) |
Output Classes |
ANG: Anger, DIS: Disgust, FEA: Fear, HAP: Happiness, NEU: Neutral, SAD: Sadness |
Model Configuration
Property |
Details |
Hidden Size |
768 |
Number of Attention Heads |
12 |
Number of Hidden Layers |
12 |
Patch Size |
16 |
Maximum Length |
1024 |
Dropout Probability |
0.0 |
Activation Function |
GELU (Gaussian Error Linear Unit) |
Optimizer |
Adam |
Learning Rate |
1e - 4 |
Training Details
Property |
Details |
Dataset |
CREMA - D (Emotion - Labeled Speech Data) |
Data Augmentation |
Noise injection, Time shifting, Speed perturbation |
Fine - Tuning Epochs |
5 |
Batch Size |
16 |
Learning Rate Scheduler |
Linear decay |
Best Validation Accuracy |
60.71% |
Best Checkpoint |
./results/checkpoint-1119 |
Metrics
Validation Results
- Best Validation Accuracy: 60.71%
- Validation Loss: 1.1126
Evaluation Details
- Eval Dataset: CREMA - D test split
- Batch Size: 16
- Number of Steps: 94
đ§ Technical Details
The model is fine - tuned on the CREMA - D dataset using data augmentation techniques such as noise injection, time shifting, and speed perturbation. It uses the Adam optimizer with a learning rate of 1e - 4 and a linear decay learning rate scheduler. The model architecture is an Audio Spectrogram Transformer (AST) with specific configuration parameters like a hidden size of 768, 12 attention heads, and 12 hidden layers.
đ License
The model is shared under the MIT License. Refer to the licensing details in the repository.
đ Citation
If you use this model in your work, please cite:
@misc{ast-finetuned-model,
author = {forwarder1121},
title = {Fine-Tuned Audio Spectrogram Transformer for Emotion Classification},
year = {2024},
url = {https://huggingface.co/forwarder1121/ast-finetuned-model},
}
đ Contact
For questions, reach out to forwarder1121@naver.com
.