首頁

Speech Emotion Recognition With Openai Whisper Large V3

由firdhokk開發

本項目利用Whisper模型實現語音情感識別，能夠將音頻分類為快樂、悲傷、驚訝等不同情感類別。

音頻分類

Transformers

開源協議:Apache-2.0 #語音情感識別 #高精度分類 #多語種支持

下載量 7,750

發布時間 : 9/21/2024

模型概述

該模型是基於OpenAI Whisper Large V3微調的語音情感識別模型，能夠準確識別語音中的情感類別。

模型特點

高準確率情感識別

模型在測試集上達到91.99%的準確率，能夠有效識別多種語音情感。

基於Whisper架構

利用Whisper Large V3的強大音頻處理能力進行微調，繼承了其優秀的特徵提取能力。

多數據集訓練

整合RAVDESS、SAVEE、TESS和URDU等多個語音情感數據集進行訓練，提高泛化能力。

模型能力

語音情感識別

音頻分類

多情感類別識別

使用案例

心理健康分析

心理諮詢輔助

通過分析客戶語音情感變化，輔助心理諮詢師評估客戶情緒狀態。

準確識別7種主要情感狀態

客戶服務

客服質量監控

自動分析客服通話中的情感變化，評估服務質量。

可即時監控客服情緒狀態

庫名稱: transformers
許可證: apache-2.0
基礎模型: openai/whisper-large-v3
標籤:

訓練生成
指標:
準確率
精確率
召回率
F1值
模型索引:
名稱: speech-emotion-recognition-with-openai-whisper-large-v3
結果: []

🎧 基於Whisper的語音情感識別

本項目利用Whisper模型實現語音情感識別，目標是將音頻分類為快樂、悲傷、驚訝等不同情感類別。

🗂 數據集

訓練與評估數據來自多個數據集：

數據集包含標註情感的錄音，情感分佈如下：

情感	數量
悲傷	752
快樂	752
憤怒	752
中性	716
厭惡	652
恐懼	652
驚訝	652
平靜	192

訓練時因樣本不足排除了"平靜"類別。

🎤 預處理

音頻加載：使用Librosa加載音頻並轉為numpy數組
特徵提取：通過Whisper特徵提取器標準化音頻特徵

🔧 模型

採用微調後的Whisper Large V3模型進行音頻分類：

模型: openai/whisper-large-v3
輸出: 情感標籤（憤怒/厭惡/恐懼/快樂/中性/悲傷/驚訝）

將情感標籤映射為數字ID用於訓練評估。

⚙️ 訓練參數

學習率: 5e-05
訓練批大小: 2
評估批大小: 2
隨機種子: 42
梯度累積步數: 5
有效批大小: 10
優化器: Adam（beta=(0.9,0.999)，epsilon=1e-08）
學習率調度器: 線性
預熱比例: 0.1
訓練輪次: 25
混合精度訓練: 原生AMP

使用Wandb進行實驗跟蹤。

📊 評估指標

損失值: 0.5008
準確率: 91.99%
精確率: 92.30%
召回率: 91.99%
F1值: 91.98%

高指標值表明模型能有效識別語音情感。

🧪 訓練結果

完整結果見Wandb記錄：

訓練損失	輪次	步數	驗證損失	準確率	精確率	召回率	F1值
0.4948	0.9995	394	0.4911	82.86%	84.49%	82.86%	83.02%
...	...	...	...	...	...	...	...
0.0026	10.9995	4336	0.8334	87.73%	89.49%	87.73%	87.70%

🚀 使用指南

from transformers import AutoModelForAudioClassification, AutoFeatureExtractor
import librosa
import torch
import numpy as np

model_id = "firdhokk/speech-emotion-recognition-with-openai-whisper-large-v3"
model = AutoModelForAudioClassification.from_pretrained(model_id)

feature_extractor = AutoFeatureExtractor.from_pretrained(model_id, do_normalize=True)
id2label = model.config.id2label

def preprocess_audio(audio_path, feature_extractor, max_duration=30.0):
    audio_array, sampling_rate = librosa.load(audio_path, sr=feature_extractor.sampling_rate)
    
    max_length = int(feature_extractor.sampling_rate * max_duration)
    if len(audio_array) > max_length:
        audio_array = audio_array[:max_length]
    else:
        audio_array = np.pad(audio_array, (0, max_length - len(audio_array)))

    inputs = feature_extractor(
        audio_array,
        sampling_rate=feature_extractor.sampling_rate,
        max_length=max_length,
        truncation=True,
        return_tensors="pt",
    )
    return inputs

def predict_emotion(audio_path, model, feature_extractor, id2label, max_duration=30.0):
    inputs = preprocess_audio(audio_path, feature_extractor, max_duration)
    
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model = model.to(device)
    inputs = {key: value.to(device) for key, value in inputs.items()}

    with torch.no_grad():
        outputs = model(**inputs)

    logits = outputs.logits
    predicted_id = torch.argmax(logits, dim=-1).item()
    predicted_label = id2label[predicted_id]
    
    return predicted_label

audio_path = "/content/drive/MyDrive/Audio/Speech_URDU/Happy/SM5_F4_H058.wav"
predicted_emotion = predict_emotion(audio_path, model, feature_extractor, id2label)
print(f"預測情感: {predicted_emotion}")