XLSR - Wav2Vec Speech Emotion Recognition Model - Open - source and Free to Recognize Five Emotions such as Anger and Disgust

Xlsr Wav2vec Speech Emotion Recognition

Developed by harshit345

A speech emotion recognition model based on the XLSR-Wav2Vec architecture, capable of identifying five basic emotions: anger, disgust, fear, happiness, and sadness.

Audio Classification

Transformers

EnglishOpen Source License:Apache-2.0 #Speech Emotion Recognition #Multi-emotion Classification #High-precision Audio Analysis

Downloads 498

Release Time : 3/2/2022

Model Overview

This model uses the Wav2Vec2 architecture for speech emotion classification, suitable for identifying the emotional state of speakers from speech signals.

Model Features

Multi-emotion Recognition

Capable of identifying five basic emotions: anger, disgust, fear, happiness, and sadness.

Wav2Vec2-based Architecture

Utilizes the self-supervised learning capabilities of Wav2Vec2, performing well on speech emotion recognition tasks.

High Accuracy

Achieves an overall accuracy of 80.6% on test data, with balanced performance across all emotion categories.

Model Capabilities

Speech Emotion Classification

Speech Signal Processing

Emotion Probability Scoring

Use Cases

Human-Computer Interaction

Customer Service System Emotion Analysis

Analyzes the emotional state in customer speech to help the customer service system make smarter responses.

Accurately identifies negative emotions such as anger and dissatisfaction in customers.

Mental Health

Emotional State Monitoring

Analyzes users' emotional changes through daily speech.

Can be used for auxiliary diagnosis of psychological disorders such as depression.

🚀 XLSR-Wav2Vec Speech Emotion Recognition

This project focuses on speech emotion recognition using the XLSR-Wav2Vec model, offering installation guidance, prediction code examples, and evaluation results.

🚀 Quick Start

Prerequisites

Ensure you have the necessary packages installed. You can install them using the following commands:

!pip install git+https://github.com/huggingface/datasets.git
!pip install git+https://github.com/huggingface/transformers.git
!pip install torchaudio
!pip install librosa

✨ Features

Audio Classification: Capable of classifying speech emotions such as anger, disgust, fear, happiness, and sadness.
Model Evaluation: Provides detailed evaluation metrics for each emotion class and overall performance.

📦 Installation

To install the required packages for this project, run the following commands:

!pip install git+https://github.com/huggingface/datasets.git
!pip install git+https://github.com/huggingface/transformers.git
!pip install torchaudio
!pip install librosa

💻 Usage Examples

Basic Usage

import torch
import torch.nn as nn
import torch.nn.functional as F
import torchaudio
from transformers import AutoConfig, Wav2Vec2FeatureExtractor
import librosa
import IPython.display as ipd
import numpy as np
import pandas as pd

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_name_or_path = "harshit345/xlsr-wav2vec-speech-emotion-recognition"
config = AutoConfig.from_pretrained(model_name_or_path)
feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained(model_name_or_path)
sampling_rate = feature_extractor.sampling_rate
model = Wav2Vec2ForSpeechClassification.from_pretrained(model_name_or_path).to(device)

def speech_file_to_array_fn(path, sampling_rate):
    speech_array, _sampling_rate = torchaudio.load(path)
    resampler = torchaudio.transforms.Resample(_sampling_rate)
    speech = resampler(speech_array).squeeze().numpy()
    return speech

def predict(path, sampling_rate):
    speech = speech_file_to_array_fn(path, sampling_rate)
    inputs = feature_extractor(speech, sampling_rate=sampling_rate, return_tensors="pt", padding=True)
    inputs = {key: inputs[key].to(device) for key in inputs}
    with torch.no_grad():
        logits = model(**inputs).logits
    scores = F.softmax(logits, dim=1).detach().cpu().numpy()[0]
    outputs = [{"Emotion": config.id2label[i], "Score": f"{round(score * 100, 3):.1f}%"} for i, score in enumerate(scores)]
    return outputs

Prediction Example

# path for a sample
path = '/data/jtes_v1.1/wav/f01/ang/f01_ang_01.wav'   
outputs = predict(path, sampling_rate)
print(outputs)

The output will be similar to:

[{'Emotion': 'anger', 'Score': '78.3%'},
 {'Emotion': 'disgust', 'Score': '11.7%'},
 {'Emotion': 'fear', 'Score': '5.4%'},
 {'Emotion': 'happiness', 'Score': '4.1%'},
 {'Emotion': 'sadness', 'Score': '0.5%'}]

📚 Documentation

Evaluation

The following tables summarize the scores obtained by the model overall and per each class.

Emotions	precision	recall	f1-score	accuracy
anger	0.82	1.00	0.81
disgust	0.85	0.96	0.85
fear	0.78	0.88	0.80
happiness	0.84	0.71	0.78
sadness	0.86	1.00	0.79
			Overall	0.806

Colab Notebook

You can access the Colab Notebook for this project here.

📄 License

This project is licensed under the Apache-2.0 License.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご