đ Model for Dimensional Speech Emotion Recognition based on Wav2vec 2.0
A model for dimensional speech emotion recognition leveraging Wav2vec 2.0, offering predictions for arousal, dominance, and valence.
đ Quick Start
This model is designed for research purposes only. A commercial license for a model trained on more data can be obtained from audEERING. It takes a raw audio signal as input and outputs predictions for arousal, dominance, and valence in the range of approximately 0...1. Additionally, it provides the pooled states of the last transformer layer.
The model was created by fine-tuning Wav2Vec2-Large-Robust on MSP-Podcast (v1.7). Before fine-tuning, the model was pruned from 24 to 12 transformer layers. An ONNX export of the model is available from doi:10.5281/zenodo.6221127. Further details can be found in the associated paper and tutorial.
⨠Features
- Research Focus: Intended primarily for research applications.
- Commercial Licensing: Option to acquire a commercial license for a more data - trained model.
- Multi - Output: Predicts arousal, dominance, and valence, and provides pooled states of the last transformer layer.
- Fine - Tuned: Based on fine - tuning Wav2Vec2 - Large - Robust on MSP - Podcast dataset.
- Pruned Architecture: Pruned from 24 to 12 transformer layers before fine - tuning.
- ONNX Export: Available ONNX export for broader deployment.
đĻ Installation
No specific installation steps are provided in the original README. If you want to use this model, you can follow the usage example to load it from the hub.
đģ Usage Examples
Basic Usage
import numpy as np
import torch
import torch.nn as nn
from transformers import Wav2Vec2Processor
from transformers.models.wav2vec2.modeling_wav2vec2 import (
Wav2Vec2Model,
Wav2Vec2PreTrainedModel,
)
class RegressionHead(nn.Module):
r"""Classification head."""
def __init__(self, config):
super().__init__()
self.dense = nn.Linear(config.hidden_size, config.hidden_size)
self.dropout = nn.Dropout(config.final_dropout)
self.out_proj = nn.Linear(config.hidden_size, config.num_labels)
def forward(self, features, **kwargs):
x = features
x = self.dropout(x)
x = self.dense(x)
x = torch.tanh(x)
x = self.dropout(x)
x = self.out_proj(x)
return x
class EmotionModel(Wav2Vec2PreTrainedModel):
r"""Speech emotion classifier."""
def __init__(self, config):
super().__init__(config)
self.config = config
self.wav2vec2 = Wav2Vec2Model(config)
self.classifier = RegressionHead(config)
self.init_weights()
def forward(
self,
input_values,
):
outputs = self.wav2vec2(input_values)
hidden_states = outputs[0]
hidden_states = torch.mean(hidden_states, dim=1)
logits = self.classifier(hidden_states)
return hidden_states, logits
device = 'cpu'
model_name = 'audeering/wav2vec2-large-robust-12-ft-emotion-msp-dim'
processor = Wav2Vec2Processor.from_pretrained(model_name)
model = EmotionModel.from_pretrained(model_name).to(device)
sampling_rate = 16000
signal = np.zeros((1, sampling_rate), dtype=np.float32)
def process_func(
x: np.ndarray,
sampling_rate: int,
embeddings: bool = False,
) -> np.ndarray:
r"""Predict emotions or extract embeddings from raw audio signal."""
y = processor(x, sampling_rate=sampling_rate)
y = y['input_values'][0]
y = y.reshape(1, -1)
y = torch.from_numpy(y).to(device)
with torch.no_grad():
y = model(y)[0 if embeddings else 1]
y = y.detach().cpu().numpy()
return y
print(process_func(signal, sampling_rate))
print(process_func(signal, sampling_rate, embeddings=True))
Advanced Usage
There is no advanced usage example in the original README. If you want to explore more complex scenarios, you can modify the input audio signals or adjust the model parameters based on the basic usage example.
đ Documentation
The associated paper and tutorial provide more detailed information about the model.
đ License
This model is released under the cc - by - nc - sa - 4.0 license.