Speech_Emotion_Recognition_wav2vec2 Open-source Speech Emotion Recognition Model - Free Support for 7 Emotion Classifications

Home

Speech Emotion Recognition Wav2vec2 Large Xlsr 53 240304 SER Fine Tuned2.0

Developed by hughlan1214

A speech emotion recognition model based on wav2vec2-large-xlsr-53, supporting 7 emotion classifications

Audio Classification

Transformers

Open Source License:Apache-2.0 #Speech Emotion Recognition #Multilingual Support #Real-time Emotion Inference

Downloads 145

Release Time : 3/4/2024

Model Overview

This model, fine-tuned from facebook/wav2vec2-large-xlsr-53, can identify 7 types of emotions in speech (anger, disgust, fear, happiness, neutral, sadness, surprise), providing a foundation for multimodal emotion analysis.

Model Features

Cross-lingual Capability

Although trained only on English data, the model performs well in emotion recognition for Chinese and French

Multi-emotion Classification

Capable of identifying 7 different basic human emotional states

Multi-dataset Fusion Training

Trained on fused data from four mainstream speech emotion datasets: Crema, Ravdess, Savee, and Tess

Model Capabilities

Speech Emotion Recognition

Cross-lingual Emotion Analysis

Real-time Emotion Inference

Use Cases

Human-Computer Interaction

Intelligent Customer Service Emotion Analysis

Real-time analysis of emotional states in customer speech

Improves customer service response quality and user experience

Mental Health

Emotional State Monitoring

Analyzing user emotional changes through speech

Assists in mental health assessments

🚀 SER_wav2vec2-large-xlsr-53_240304_fine-tuned_2

This model is a fine - tuned version for speech emotion recognition, leveraging a pre - trained model on English datasets and showing cross - linguistic capabilities.

🚀 Quick Start

This model is a fine-tuned version of hughlan1214/SER_wav2vec2-large-xlsr-53_240304_fine-tuned1.1 on a Speech Emotion Recognition (en) dataset.

The dataset includes the 4 most popular English datasets: Crema, Ravdess, Savee, and Tess, with over 12,000 .wav audio files in total. Each of these four datasets has 6 to 8 different emotional labels.

On the evaluation set, it achieves the following results:

Loss: 1.0601
Accuracy: 0.6731
Precision: 0.6761
Recall: 0.6794
F1: 0.6738

✨ Features

Model description

The model was obtained through feature extraction using facebook/wav2vec2-large-xlsr-53 and several rounds of fine - tuning. It predicts 7 types of emotions in speech, aiming to lay the foundation for real - time inference of user emotions using human micro - expressions on the visual level and context semantics under LLMS.

Although trained on purely English datasets, post - release testing showed that it performs well in predicting emotions in Chinese and French, demonstrating the powerful cross - linguistic capability of the facebook/wav2vec2-large-xlsr-53 pre - trained model.

emotions = ['angry', 'disgust', 'fear', 'happy', 'neutral', 'sad', 'surprise']

Intended uses & limitations

More information needed

📦 Installation

No installation steps provided in the original document.

💻 Usage Examples

No specific code examples for usage are provided in the original document.

📚 Documentation

Training and evaluation data

70/30 of the entire dataset was used for training and evaluation.

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 5e - 05
train_batch_size: 8
eval_batch_size: 4
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon = 1e - 08
lr_scheduler_type: cosine
lr_scheduler_warmup_ratio: 0.1
num_epochs: 10

Training results

Training Loss	Epoch	Step	Validation Loss	Accuracy	Precision	Recall	F1
0.8904	1.0	1048	1.1923	0.5773	0.6162	0.5563	0.5494
1.1394	2.0	2096	1.0143	0.6071	0.6481	0.6189	0.6057
0.9373	3.0	3144	1.0585	0.6126	0.6296	0.6254	0.6119
0.7405	4.0	4192	0.9580	0.6514	0.6732	0.6562	0.6576
1.1638	5.0	5240	0.9940	0.6486	0.6485	0.6627	0.6435
0.6741	6.0	6288	1.0307	0.6628	0.6710	0.6711	0.6646
0.604	7.0	7336	1.0248	0.6667	0.6678	0.6751	0.6682
0.6835	8.0	8384	1.0396	0.6722	0.6803	0.6790	0.6743
0.5421	9.0	9432	1.0493	0.6714	0.6765	0.6785	0.6736
0.5728	10.0	10480	1.0601	0.6731	0.6761	0.6794	0.6738

Framework versions

Transformers 4.38.1
Pytorch 2.2.1
Datasets 2.17.1
Tokenizers 0.15.2

🔧 Technical Details

The model uses the facebook/wav2vec2-large-xlsr-53 pre - trained model for feature extraction and then fine - tunes it on the speech emotion recognition dataset. The cross - linguistic performance shows the generalization ability of the pre - trained model.

📄 License

The model is released under the Apache 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご