Wav2Vec2-XLSR Greek Speech Emotion Recognition Model - Open-source Recognition of Five Emotions including Anger

Wav2vec2 Xlsr Greek Speech Emotion Recognition

Developed by m3hrdadfi

A Greek speech emotion recognition model based on the Wav2Vec 2.0 architecture, capable of identifying five emotions: anger, disgust, fear, happiness, and sadness.

Audio Classification OtherOpen Source License:Apache-2.0 #Greek Emotion Recognition #Multi-emotion Classification #High-precision Speech Analysis

Downloads 213

Release Time : 3/2/2022

Model Overview

This model utilizes the Wav2Vec 2.0 architecture, specifically trained for Greek speech emotion recognition, and can accurately classify five basic emotions.

Model Features

High Accuracy Emotion Recognition

Achieves an overall accuracy of 91% in Greek speech emotion recognition tasks.

Multi-emotion Classification

Capable of recognizing five basic emotions: anger, disgust, fear, happiness, and sadness.

Based on Wav2Vec 2.0

Utilizes the advanced Wav2Vec 2.0 architecture for speech feature extraction and classification.

Model Capabilities

Greek Speech Emotion Recognition

Speech Emotion Classification

Audio Emotion Analysis

Use Cases

Emotion Analysis

Customer Service Call Emotion Analysis

Analyze customer emotional states in customer service calls.

Can identify customer emotions such as anger or happiness, helping to improve service quality.

Psychological State Assessment

Assess the speaker's psychological state through speech analysis.

Can assist in identifying negative emotions such as depression and anxiety.

🚀 Emotion Recognition in Greek (el) Speech using Wav2Vec 2.0

This project utilizes the Wav2Vec 2.0 model to perform emotion recognition on Greek (el) speech, offering a solution for automatic speech emotion analysis.

🚀 Quick Start

📦 Installation

To use this project, you need to install the following required packages:

# requirement packages
!pip install git+https://github.com/huggingface/datasets.git
!pip install git+https://github.com/huggingface/transformers.git
!pip install torchaudio
!pip install librosa

💻 Usage Examples

🔍 Basic Usage

First, import the necessary libraries:

import torch
import torch.nn as nn
import torch.nn.functional as F
import torchaudio
from transformers import AutoConfig, Wav2Vec2FeatureExtractor

import librosa
import IPython.display as ipd
import numpy as np
import pandas as pd

Then, set up the device, load the model and feature extractor:

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_name_or_path = "m3hrdadfi/wav2vec2-xlsr-greek-speech-emotion-recognition"
config = AutoConfig.from_pretrained(model_name_or_path)
feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained(model_name_or_path)
sampling_rate = feature_extractor.sampling_rate
model = Wav2Vec2ForSpeechClassification.from_pretrained(model_name_or_path).to(device)

Next, define the functions for converting speech files to arrays and making predictions:

def speech_file_to_array_fn(path, sampling_rate):
    speech_array, _sampling_rate = torchaudio.load(path)
    resampler = torchaudio.transforms.Resample(_sampling_rate)
    speech = resampler(speech_array).squeeze().numpy()
    return speech


def predict(path, sampling_rate):
    speech = speech_file_to_array_fn(path, sampling_rate)
    inputs = feature_extractor(speech, sampling_rate=sampling_rate, return_tensors="pt", padding=True)
    inputs = {key: inputs[key].to(device) for key in inputs}

    with torch.no_grad():
        logits = model(**inputs).logits

    scores = F.softmax(logits, dim=1).detach().cpu().numpy()[0]
    outputs = [{"Emotion": config.id2label[i], "Score": f"{round(score * 100, 3):.1f}%"} for i, score in enumerate(scores)]
    return outputs

Finally, make a prediction:

path = "/path/to/disgust.wav"
outputs = predict(path, sampling_rate)

The output will look like this:

[
{'Emotion': 'anger', 'Score': '0.0%'},
{'Emotion': 'disgust', 'Score': '99.2%'},
{'Emotion': 'fear', 'Score': '0.1%'},
{'Emotion': 'happiness', 'Score': '0.3%'},
{'Emotion': 'sadness', 'Score': '0.5%'}
]

📚 Documentation

🔧 Evaluation

The following table summarizes the scores obtained by the model overall and per each class:

Emotion	Precision	Recall	F1-Score	Accuracy
Anger	0.92	1.00	0.96
Disgust	0.85	0.96	0.90
Fear	0.88	0.88	0.88
Happiness	0.94	0.71	0.81
Sadness	0.96	1.00	0.98
			Overall	0.91

❓ Questions?

Post a Github issue from HERE.

📄 License

This project is licensed under the Apache 2.0 license.

🔍 Additional Information

Language: Greek (el)
Datasets: aesdd
Tags: audio, automatic-speech-recognition, speech, speech-emotion-recognition

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご