wav2vec2-xls-r-300m-emotion-ru Open Source Model - Free Deployment for Recognizing Russian Speech Emotions

Wav2vec2 Xls R 300m Emotion Ru

Developed by KELONMYOSA

A Russian speech emotion recognition model fine-tuned based on facebook/wav2vec2-xls-r-300m, capable of identifying emotions such as neutral, positive, angry, and sad.

Audio Classification

Transformers

OtherOpen Source License:Apache-2.0 #Russian speech emotion recognition #High-accuracy classification #Virtual assistant interaction analysis

Downloads 61

Release Time : 5/25/2023

Model Overview

This model is designed for Speech Emotion Recognition (SER) tasks, optimized for Russian speech, and can identify five emotional states.

Model Features

Multi-emotion recognition

Capable of identifying five emotional states: neutral, positive, angry, sad, and others.

Russian language optimization

Specifically fine-tuned for Russian speech data.

High accuracy

Achieves 90.14% accuracy on the validation set.

Model Capabilities

Speech emotion classification

Russian speech analysis

Real-time emotion recognition

Use Cases

Virtual assistants

Emotion-aware dialogue systems

Adjust virtual assistant response strategies based on user speech emotions

Enhances user experience and interaction naturalness

Customer service analysis

Customer emotion monitoring

Automatically analyze customer emotion changes during service calls

Identify high-risk angry calls and provide warnings

🚀 Speech Emotion Recognition

This model is a fine - tuned version of facebook/wav2vec2-xls-r-300m for Speech Emotion Recognition (SER). It addresses the challenge of accurately classifying emotions in speech, providing valuable insights for applications like virtual assistants and sentiment analysis.

🚀 Quick Start

This section provides a quick guide on how to use the model for speech emotion recognition.

✨ Features

Fine - tuned Model: Based on facebook/wav2vec2-xls-r-300m, fine - tuned for speech emotion recognition.
Multilingual Support: Trained on Russian audio data, suitable for Russian speech emotion analysis.
Multiple Emotion Classification: Capable of classifying five emotions: neutral, positive, angry, sad, and other.

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

emotions = ['neutral', 'positive', 'angry', 'sad', 'other']

Advanced Usage

Pipeline

from transformers.pipelines import pipeline

pipe = pipeline(model="KELONMYOSA/wav2vec2-xls-r-300m-emotion-ru", trust_remote_code=True)

# The pipeline input can be a file, path or link
result = pipe("speech.wav")
print(result)

[{'label': 'neutral', 'score': 0.00318}, {'label': 'positive', 'score': 0.00376}, {'label': 'sad', 'score': 0.00145}, {'label': 'angry', 'score': 0.98984}, {'label': 'other', 'score': 0.00176}]

AutoModel

import librosa
import torch
import torch.nn.functional as F
from transformers import AutoConfig, Wav2Vec2Processor, AutoModelForAudioClassification

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_name_or_path = "KELONMYOSA/wav2vec2-xls-r-300m-emotion-ru"
config = AutoConfig.from_pretrained(model_name_or_path)
processor = Wav2Vec2Processor.from_pretrained(model_name_or_path)
sampling_rate = processor.feature_extractor.sampling_rate
model = AutoModelForAudioClassification.from_pretrained(model_name_or_path, trust_remote_code=True).to(device)


def predict(path):
    speech, sr = librosa.load(path, sr=sampling_rate)
    features = processor(speech, sampling_rate=sampling_rate, return_tensors="pt", padding=True)

    input_values = features.input_values.to(device)
    attention_mask = features.attention_mask.to(device)

    with torch.no_grad():
        logits = model(input_values, attention_mask=attention_mask).logits

    scores = F.softmax(logits, dim=1).detach().cpu().numpy()[0]
    outputs = [{"label": config.id2label[i], "score": round(score, 5)} for i, score in
               enumerate(scores)]
    return outputs


print(predict("speech.wav"))

[{'label': 'neutral', 'score': 0.00318}, {'label': 'positive', 'score': 0.00376}, {'label': 'sad', 'score': 0.00145}, {'label': 'angry', 'score': 0.98984}, {'label': 'other', 'score': 0.00176}]

📚 Documentation

Dataset

The dataset used to fine - tune the original pre - trained model is the DUSHA dataset. It consists of about 125,000 audio recordings in Russian with four basic emotions that usually appear in a dialog with a virtual assistant: Happiness (Positive), Sadness, Anger, and Neutral emotion.

Evaluation

It achieves the following results:

Training Loss: 0.528700
Validation Loss: 0.349617
Accuracy: 0.901369

emotion	precision	recall	f1 - score	support
neutral	0.92	0.94	0.93	15886
positive	0.85	0.79	0.82	2481
sad	0.77	0.82	0.79	2506
angry	0.89	0.83	0.86	3072
other	0.99	0.74	0.85	226

accuracy			0.90	24171
macro avg	0.89	0.82	0.85	24171
weighted avg	0.90	0.90	0.90	24171

📄 License

This project is licensed under the Apache - 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご