ast-finetuned-model Open-source Audio Model - Free Deployment for Precise Speech Emotion Classification

Ast Finetuned Model

Developed by forwarder1121

This is a fine-tuned model based on Audio Spectrogram Transformer (AST), specifically designed for emotion classification in speech audio.

Audio Classification

Transformers

EnglishOpen Source License:Apache-2.0 #Speech Emotion Recognition #Spectrogram Analysis #Multi-emotion Classification

Downloads 174

Release Time : 11/17/2024

Model Overview

The model is fine-tuned on the CREMA-D dataset, focusing on six emotion categories (anger, disgust, fear, happiness, neutral, sadness), suitable for speech emotion recognition tasks.

Model Features

Based on Audio Spectrogram Transformer

Utilizes the advanced Audio Spectrogram Transformer architecture to effectively capture emotional features in speech.

Six Emotion Categories

Supports recognition of six emotion categories: anger, disgust, fear, happiness, neutral, and sadness.

Data Augmentation

Employs data augmentation techniques such as noise injection, time shifting, and speed perturbation during training to enhance model robustness.

Model Capabilities

Speech Emotion Recognition

Audio Classification

Emotion Analysis

Use Cases

Human-Computer Interaction

Smart Customer Service Emotion Analysis

Used to analyze users' emotional states during customer service calls to improve service quality.

Mental Health

Emotional State Monitoring

Analyzes users' emotional changes through speech for mental health applications.

🚀 AST Fine-Tuned Model for Emotion Classification

This is a fine - tuned Audio Spectrogram Transformer (AST) model designed for classifying emotions in speech audio. It was fine - tuned on the CREMA - D dataset, focusing on six emotional categories. The base model is from MIT's pre - trained AST model.

🚀 Quick Start

To use this model for emotion classification, you can follow these steps:

from transformers import AutoModelForAudioClassification, AutoProcessor

# Load the model and processor
model = AutoModelForAudioClassification.from_pretrained("forwarder1121/ast-finetuned-model")
processor = AutoProcessor.from_pretrained("forwarder1121/ast-finetuned-model")

# Prepare input audio (e.g., waveform) as log-mel spectrogram
inputs = processor("path_to_audio.wav", sampling_rate=16000, return_tensors="pt")

# Make predictions
outputs = model(**inputs)
predicted_class = outputs.logits.argmax(-1).item()

print(f"Predicted emotion: {model.config.id2label[str(predicted_class)]}")

✨ Features

Emotion Classification: Specifically designed to classify six different emotions in speech audio.
Fine - Tuned on CREMA - D: Trained on a well - known emotion - labeled speech dataset.
Based on AST Architecture: Utilizes the Audio Spectrogram Transformer architecture.

📦 Installation

Since this model is based on the transformers library, you can install it using the following command:

pip install transformers

📚 Documentation

Model Details

Property	Details
Base Model	`MIT/ast-finetuned-audioset-10-10-0.4593`
Fine - Tuned Dataset	CREMA - D
Architecture	Audio Spectrogram Transformer (AST)
Model Type	Single - label classification
Input Features	Log - Mel Spectrograms (128 mel bins)
Output Classes	ANG: Anger, DIS: Disgust, FEA: Fear, HAP: Happiness, NEU: Neutral, SAD: Sadness

Model Configuration

Property	Details
Hidden Size	768
Number of Attention Heads	12
Number of Hidden Layers	12
Patch Size	16
Maximum Length	1024
Dropout Probability	0.0
Activation Function	GELU (Gaussian Error Linear Unit)
Optimizer	Adam
Learning Rate	1e - 4

Training Details

Property	Details
Dataset	CREMA - D (Emotion - Labeled Speech Data)
Data Augmentation	Noise injection, Time shifting, Speed perturbation
Fine - Tuning Epochs	5
Batch Size	16
Learning Rate Scheduler	Linear decay
Best Validation Accuracy	60.71%
Best Checkpoint	`./results/checkpoint-1119`

Metrics

Validation Results

Best Validation Accuracy: 60.71%
Validation Loss: 1.1126

Evaluation Details

Eval Dataset: CREMA - D test split
Batch Size: 16
Number of Steps: 94

🔧 Technical Details

The model is fine - tuned on the CREMA - D dataset using data augmentation techniques such as noise injection, time shifting, and speed perturbation. It uses the Adam optimizer with a learning rate of 1e - 4 and a linear decay learning rate scheduler. The model architecture is an Audio Spectrogram Transformer (AST) with specific configuration parameters like a hidden size of 768, 12 attention heads, and 12 hidden layers.

📄 License

The model is shared under the MIT License. Refer to the licensing details in the repository.

📖 Citation

If you use this model in your work, please cite:

@misc{ast-finetuned-model,
  author = {forwarder1121},
  title = {Fine-Tuned Audio Spectrogram Transformer for Emotion Classification},
  year = {2024},
  url = {https://huggingface.co/forwarder1121/ast-finetuned-model},
}

📞 Contact

For questions, reach out to forwarder1121@naver.com.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご