Wav2vec2-large-superb-er Open-Source Emotional Recognition Model - Accurately Identify Emotional Categories from Speech

Wav2vec2 Large Superb Er

Developed by superb

This is an emotion recognition model based on the Wav2Vec2-Large model, specifically designed to identify emotion categories from speech.

Audio Classification

Transformers

EnglishOpen Source License:Apache-2.0 #Speech Emotion Recognition #High-precision Classification #IEMOCAP Dataset

Downloads 1,442

Release Time : 3/2/2022

Model Overview

This model is fine-tuned from Facebook's wav2vec2-large-lv60 model on the SUPERB emotion recognition task, primarily used to identify four basic emotion categories from speech.

Model Features

Based on Wav2Vec2 Pre-trained Model

Utilizes the large-scale pre-trained Wav2Vec2 model with powerful speech feature extraction capabilities

Emotion Recognition Capability

Fine-tuned specifically for speech emotion recognition tasks, capable of identifying four basic emotion categories

16kHz Sampling Rate Support

Supports 16kHz sampled speech input, consistent with the original pre-trained model

Model Capabilities

Speech Emotion Recognition

Audio Classification

Use Cases

Human-Computer Interaction

Customer Service System Emotion Analysis

Analyzes emotional states in customer speech to help customer service systems make more intelligent responses

Mental Health

Emotional State Monitoring

Analyzes users' emotional changes through speech for mental health applications

🚀 Wav2Vec2-Large for Emotion Recognition

This is a model for emotion recognition based on the Wav2Vec2 architecture, which can predict emotion classes for speech utterances.

🚀 Quick Start

You can use the model via the Audio Classification pipeline or use it directly. See the "💻 Usage Examples" section below for detailed code examples.

✨ Features

Ported from S3PRL's Wav2Vec2 for the SUPERB Emotion Recognition task.
Based on the wav2vec2-large-lv60 model, pretrained on 16kHz sampled speech audio.
Adopts the widely - used ER dataset IEMOCAP and follows the conventional evaluation protocol.

📦 Installation

No specific installation steps are provided in the original document, so this section is skipped.

💻 Usage Examples

Basic Usage

You can use the model via the Audio Classification pipeline:

from datasets import load_dataset
from transformers import pipeline

dataset = load_dataset("anton-l/superb_demo", "er", split="session1")

classifier = pipeline("audio-classification", model="superb/wav2vec2-large-superb-er")
labels = classifier(dataset[0]["file"], top_k=5)

Advanced Usage

Or use the model directly:

import torch
import librosa
from datasets import load_dataset
from transformers import Wav2Vec2ForSequenceClassification, Wav2Vec2FeatureExtractor

def map_to_array(example):
    speech, _ = librosa.load(example["file"], sr=16000, mono=True)
    example["speech"] = speech
    return example

# load a demo dataset and read audio files
dataset = load_dataset("anton-l/superb_demo", "er", split="session1")
dataset = dataset.map(map_to_array)

model = Wav2Vec2ForSequenceClassification.from_pretrained("superb/wav2vec2-large-superb-er")
feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained("superb/wav2vec2-large-superb-er")

# compute attention masks and normalize the waveform if needed
inputs = feature_extractor(dataset[:4]["speech"], sampling_rate=16000, padding=True, return_tensors="pt")

logits = model(**inputs).logits
predicted_ids = torch.argmax(logits, dim=-1)
labels = [model.config.id2label[_id] for _id in predicted_ids.tolist()]

📚 Documentation

Model description

This is a ported version of S3PRL's Wav2Vec2 for the SUPERB Emotion Recognition task.

The base model is wav2vec2-large-lv60, which is pretrained on 16kHz sampled speech audio. When using the model make sure that your speech input is also sampled at 16Khz.

For more information refer to SUPERB: Speech processing Universal PERformance Benchmark

Task and dataset description

Emotion Recognition (ER) predicts an emotion class for each utterance. The most widely used ER dataset IEMOCAP is adopted, and we follow the conventional evaluation protocol: we drop the unbalanced emotion classes to leave the final four classes with a similar amount of data points and cross - validate on five folds of the standard splits.

For the original model's training and evaluation instructions refer to the S3PRL downstream task README.

Eval results

The evaluation metric is accuracy.

	s3prl	transformers
session1	`0.6564`	`N/A`

BibTeX entry and citation info

@article{yang2021superb,
  title={SUPERB: Speech processing Universal PERformance Benchmark},
  author={Yang, Shu - wen and Chi, Po - Han and Chuang, Yung - Sung and Lai, Cheng - I Jeff and Lakhotia, Kushal and Lin, Yist Y and Liu, Andy T and Shi, Jiatong and Chang, Xuankai and Lin, Guan - Ting and others},
  journal={arXiv preprint arXiv:2105.01051},
  year={2021}
}

📄 License

This model is licensed under the apache - 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご