开源Whisper模型实现语音情感识别 - 免费将音频分类为多种情感类别

首页

Speech Emotion Recognition With Openai Whisper Large V3

由 firdhokk 开发

本项目利用Whisper模型实现语音情感识别，能够将音频分类为快乐、悲伤、惊讶等不同情感类别。

音频分类

Transformers

开源协议:Apache-2.0 #语音情感识别 #高精度分类 #多语种支持

下载量 7,750

发布时间 : 9/21/2024

模型简介

该模型是基于OpenAI Whisper Large V3微调的语音情感识别模型，能够准确识别语音中的情感类别。

模型特点

高准确率情感识别

模型在测试集上达到91.99%的准确率，能够有效识别多种语音情感。

基于Whisper架构

利用Whisper Large V3的强大音频处理能力进行微调，继承了其优秀的特征提取能力。

多数据集训练

整合RAVDESS、SAVEE、TESS和URDU等多个语音情感数据集进行训练，提高泛化能力。

模型能力

语音情感识别

音频分类

多情感类别识别

使用案例

心理健康分析

心理咨询辅助

通过分析客户语音情感变化，辅助心理咨询师评估客户情绪状态。

准确识别7种主要情感状态

客户服务

客服质量监控

自动分析客服通话中的情感变化，评估服务质量。

可实时监控客服情绪状态

🚀 语音情感识别与Whisper

本项目借助 Whisper 模型来识别语音中的情感。其目标是将音频记录分类到不同的情感类别中，如快乐、悲伤、惊讶等。

🚀 快速开始

本项目利用 Whisper 模型实现语音情感识别，以下是使用该模型进行情感预测的示例代码：

from transformers import AutoModelForAudioClassification, AutoFeatureExtractor
import librosa
import torch
import numpy as np

model_id = "firdhokk/speech-emotion-recognition-with-openai-whisper-large-v3"
model = AutoModelForAudioClassification.from_pretrained(model_id)

feature_extractor = AutoFeatureExtractor.from_pretrained(model_id, do_normalize=True)
id2label = model.config.id2label

def preprocess_audio(audio_path, feature_extractor, max_duration=30.0):
    audio_array, sampling_rate = librosa.load(audio_path, sr=feature_extractor.sampling_rate)
    
    max_length = int(feature_extractor.sampling_rate * max_duration)
    if len(audio_array) > max_length:
        audio_array = audio_array[:max_length]
    else:
        audio_array = np.pad(audio_array, (0, max_length - len(audio_array)))

    inputs = feature_extractor(
        audio_array,
        sampling_rate=feature_extractor.sampling_rate,
        max_length=max_length,
        truncation=True,
        return_tensors="pt",
    )
    return inputs

def predict_emotion(audio_path, model, feature_extractor, id2label, max_duration=30.0):
    inputs = preprocess_audio(audio_path, feature_extractor, max_duration)
    
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model = model.to(device)
    inputs = {key: value.to(device) for key, value in inputs.items()}

    with torch.no_grad():
        outputs = model(**inputs)

    logits = outputs.logits
    predicted_id = torch.argmax(logits, dim=-1).item()
    predicted_label = id2label[predicted_id]
    
    return predicted_label

audio_path = "/content/drive/MyDrive/Audio/Speech_URDU/Happy/SM5_F4_H058.wav"

predicted_emotion = predict_emotion(audio_path, model, feature_extractor, id2label)
print(f"Predicted Emotion: {predicted_emotion}")

✨ 主要特性

利用 Whisper 模型进行语音情感识别。
支持将音频分类到多种情感类别。

📦 安装指南

文档未提及安装步骤，故跳过此章节。

💻 使用示例

基础用法

from transformers import AutoModelForAudioClassification, AutoFeatureExtractor
import librosa
import torch
import numpy as np

model_id = "firdhokk/speech-emotion-recognition-with-openai-whisper-large-v3"
model = AutoModelForAudioClassification.from_pretrained(model_id)

feature_extractor = AutoFeatureExtractor.from_pretrained(model_id, do_normalize=True)
id2label = model.config.id2label

高级用法

def preprocess_audio(audio_path, feature_extractor, max_duration=30.0):
    audio_array, sampling_rate = librosa.load(audio_path, sr=feature_extractor.sampling_rate)
    
    max_length = int(feature_extractor.sampling_rate * max_duration)
    if len(audio_array) > max_length:
        audio_array = audio_array[:max_length]
    else:
        audio_array = np.pad(audio_array, (0, max_length - len(audio_array)))

    inputs = feature_extractor(
        audio_array,
        sampling_rate=feature_extractor.sampling_rate,
        max_length=max_length,
        truncation=True,
        return_tensors="pt",
    )
    return inputs

def predict_emotion(audio_path, model, feature_extractor, id2label, max_duration=30.0):
    inputs = preprocess_audio(audio_path, feature_extractor, max_duration)
    
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model = model.to(device)
    inputs = {key: value.to(device) for key, value in inputs.items()}

    with torch.no_grad():
        outputs = model(**inputs)

    logits = outputs.logits
    predicted_id = torch.argmax(logits, dim=-1).item()
    predicted_label = id2label[predicted_id]
    
    return predicted_label

audio_path = "/content/drive/MyDrive/Audio/Speech_URDU/Happy/SM5_F4_H058.wav"

predicted_emotion = predict_emotion(audio_path, model, feature_extractor, id2label)
print(f"Predicted Emotion: {predicted_emotion}")

📚 详细文档

🗂 数据集

用于训练和评估的数据集来自多个数据源，包括：

该数据集包含标注了各种情感的录音。以下是数据集中情感的分布情况：

情感	数量
悲伤	752
快乐	752
愤怒	752
中性	716
厌恶	652
恐惧	652
惊讶	652
平静	192

这种分布反映了数据集中情感的平衡情况，某些情感的样本数量比其他情感多。由于“平静”情感的样本数量过少，在训练过程中排除了该情感。

🎤 预处理

音频加载：使用 Librosa 加载音频文件并将其转换为 numpy 数组。
特征提取：使用 Whisper 特征提取器 处理音频数据，对音频特征进行标准化和归一化，以便输入模型。

🔧 模型

使用的模型是 Whisper Large V3 模型，并针对 音频分类 任务进行了微调：

模型：openai/whisper-large-v3
输出：情感标签 (愤怒', '厌恶', '恐惧', '快乐', '中性', '悲伤', '惊讶')

将情感标签映射到数字 ID，并用于模型的训练和评估。

⚙️ 训练

模型使用以下参数进行训练：

学习率：5e-05
训练批次大小：2
评估批次大小：2
随机种子：42
梯度累积步数：5
总训练批次大小：10（梯度累积后的有效批次大小）
优化器：Adam，参数为 betas=(0.9, 0.999) 和 epsilon=1e-08
学习率调度器：linear
学习率调度器的预热比例：0.1
训练轮数：25
混合精度训练：原生 AMP（自动混合精度）

这些参数确保了模型训练的效率和稳定性，特别是在处理像 Whisper 这样的大型数据集和深度模型时。训练过程使用 Wandb 进行实验跟踪和监控。

📊 指标

模型训练后获得的评估指标如下：

损失：0.5008
准确率：0.9199
精确率：0.9230
召回率：0.9199
F1 分数：0.9198

这些指标展示了模型在语音情感识别任务上的性能。准确率、精确率、召回率和 F1 分数的高值表明模型能够有效地从语音数据中识别情感状态。

🧪 结果

训练完成后，在测试数据集上对模型进行评估，并通过此链接使用 Wandb 监控结果。

训练损失	轮数	步数	验证损失	准确率	精确率	召回率	F1 分数
0.4948	0.9995	394	0.4911	0.8286	0.8449	0.8286	0.8302
0.6271	1.9990	788	0.5307	0.8225	0.8559	0.8225	0.8277
0.2364	2.9985	1182	0.5076	0.8692	0.8727	0.8692	0.8684
0.0156	3.9980	1576	0.5669	0.8732	0.8868	0.8732	0.8745
0.2305	5.0	1971	0.4578	0.9108	0.9142	0.9108	0.9114
0.0112	5.9995	2365	0.4701	0.9108	0.9159	0.9108	0.9114
0.0013	6.9990	2759	0.5232	0.9138	0.9204	0.9138	0.9137
0.1894	7.9985	3153	0.5008	0.9199	0.9230	0.9199	0.9198
0.0877	8.9980	3547	0.5517	0.9138	0.9152	0.9138	0.9138
0.1471	10.0	3942	0.5856	0.8895	0.9002	0.8895	0.8915
0.0026	10.9995	4336	0.8334	0.8773	0.8949	0.8773	0.8770