wav2vec2-large-robust-12-ft-emotion-msp-dim Open-source Model - Accurately Realize Speech Emotion Recognition and Analysis

Wav2vec2 Large Robust 12 Ft Emotion Msp Dim

Developed by audeering

This model is fine-tuned from Wav2Vec2-Large-Robust for speech emotion recognition, predicting values in three dimensions: arousal, dominance, and valence.

Audio Classification

Transformers

English#Speech Emotion Three-Dimensional Recognition #Wav2Vec2 Fine-tuning #Raw Audio Input

Downloads 394.51k

Release Time : 4/6/2022

Model Overview

The model takes raw audio signals as input and outputs predictions for three dimensions (approximately in the range of 0...1): arousal, dominance, and valence, while also providing the pooled state of the final transformer layer.

Model Features

Dimensional Emotion Recognition

Predicts continuous dimensional values for arousal, dominance, and valence, rather than discrete emotion categories.

Fine-tuned Pre-trained Model

Fine-tuned from Wav2Vec2-Large-Robust, leveraging the advantages of large-scale self-supervised pre-training.

Feature Extraction Capability

Can output the pooled state of the final transformer layer as a speech feature vector.

Model Optimization

The original 24-layer Transformer was pruned to 12 layers, balancing performance and efficiency.

Model Capabilities

Speech Emotion Analysis

Speech Feature Extraction

Continuous Dimensional Emotion Prediction

Use Cases

Human-Computer Interaction

Intelligent Customer Service Emotion Analysis

Analyze emotional states in user speech to optimize customer service response strategies.

Quantifiable changes in user emotions.

Mental Health

Emotional State Monitoring

Monitor emotional fluctuations in patients with psychological conditions such as depression through speech analysis.

Provides objective dimensional emotion indicators.

🚀 Model for Dimensional Speech Emotion Recognition based on Wav2vec 2.0

A model for dimensional speech emotion recognition leveraging Wav2vec 2.0, offering predictions for arousal, dominance, and valence.

🚀 Quick Start

This model is designed for research purposes only. A commercial license for a model trained on more data can be obtained from audEERING. It takes a raw audio signal as input and outputs predictions for arousal, dominance, and valence in the range of approximately 0...1. Additionally, it provides the pooled states of the last transformer layer.

The model was created by fine-tuning Wav2Vec2-Large-Robust on MSP-Podcast (v1.7). Before fine-tuning, the model was pruned from 24 to 12 transformer layers. An ONNX export of the model is available from doi:10.5281/zenodo.6221127. Further details can be found in the associated paper and tutorial.

✨ Features

Research Focus: Intended primarily for research applications.
Commercial Licensing: Option to acquire a commercial license for a more data - trained model.
Multi - Output: Predicts arousal, dominance, and valence, and provides pooled states of the last transformer layer.
Fine - Tuned: Based on fine - tuning Wav2Vec2 - Large - Robust on MSP - Podcast dataset.
Pruned Architecture: Pruned from 24 to 12 transformer layers before fine - tuning.
ONNX Export: Available ONNX export for broader deployment.

📦 Installation

No specific installation steps are provided in the original README. If you want to use this model, you can follow the usage example to load it from the hub.

💻 Usage Examples

Basic Usage

import numpy as np
import torch
import torch.nn as nn
from transformers import Wav2Vec2Processor
from transformers.models.wav2vec2.modeling_wav2vec2 import (
    Wav2Vec2Model,
    Wav2Vec2PreTrainedModel,
)


class RegressionHead(nn.Module):
    r"""Classification head."""

    def __init__(self, config):

        super().__init__()

        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
        self.dropout = nn.Dropout(config.final_dropout)
        self.out_proj = nn.Linear(config.hidden_size, config.num_labels)

    def forward(self, features, **kwargs):

        x = features
        x = self.dropout(x)
        x = self.dense(x)
        x = torch.tanh(x)
        x = self.dropout(x)
        x = self.out_proj(x)

        return x


class EmotionModel(Wav2Vec2PreTrainedModel):
    r"""Speech emotion classifier."""

    def __init__(self, config):

        super().__init__(config)

        self.config = config
        self.wav2vec2 = Wav2Vec2Model(config)
        self.classifier = RegressionHead(config)
        self.init_weights()

    def forward(
            self,
            input_values,
    ):

        outputs = self.wav2vec2(input_values)
        hidden_states = outputs[0]
        hidden_states = torch.mean(hidden_states, dim=1)
        logits = self.classifier(hidden_states)

        return hidden_states, logits


# load model from hub
device = 'cpu'
model_name = 'audeering/wav2vec2-large-robust-12-ft-emotion-msp-dim'
processor = Wav2Vec2Processor.from_pretrained(model_name)
model = EmotionModel.from_pretrained(model_name).to(device)

# dummy signal
sampling_rate = 16000
signal = np.zeros((1, sampling_rate), dtype=np.float32)


def process_func(
    x: np.ndarray,
    sampling_rate: int,
    embeddings: bool = False,
) -> np.ndarray:
    r"""Predict emotions or extract embeddings from raw audio signal."""

    # run through processor to normalize signal
    # always returns a batch, so we just get the first entry
    # then we put it on the device
    y = processor(x, sampling_rate=sampling_rate)
    y = y['input_values'][0]
    y = y.reshape(1, -1)
    y = torch.from_numpy(y).to(device)

    # run through model
    with torch.no_grad():
        y = model(y)[0 if embeddings else 1]

    # convert to numpy
    y = y.detach().cpu().numpy()

    return y


print(process_func(signal, sampling_rate))
#  Arousal    dominance valence
# [[0.5460754  0.6062266  0.40431657]]

print(process_func(signal, sampling_rate, embeddings=True))
# Pooled hidden states of last transformer layer
# [[-0.00752167  0.0065819  -0.00746342 ...  0.00663632  0.00848748
#    0.00599211]]

Advanced Usage

There is no advanced usage example in the original README. If you want to explore more complex scenarios, you can modify the input audio signals or adjust the model parameters based on the basic usage example.

📚 Documentation

The associated paper and tutorial provide more detailed information about the model.

📄 License

This model is released under the cc - by - nc - sa - 4.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご