開源語音情感識別系統 - 基於Wav2Vec2微調，精準識別7種常見情感

首頁

Speech Emotion Recognition With Facebook Wav2vec2 Large Xlsr 53

由firdhokk開發

基於Wav2Vec2 Large XLSR-53模型微調的語音情感識別系統，能夠識別7種常見情感

音頻分類

Transformers

開源協議:Apache-2.0 #語音情感分析 #多語種支持 #高精度識別

下載量 66

發布時間 : 9/20/2024

模型概述

該模型通過微調Wav2Vec2 Large XLSR-53實現語音情感分類，支持憤怒、厭惡、恐懼、快樂、中性、悲傷和驚訝7種情感識別

模型特點

高準確率情感識別

在測試集上達到91.68%的準確率和91.66%的F1值

多數據集訓練

融合RAVDESS、SAVEE、TESS和URDU多個數據集進行訓練

高效特徵提取

使用Wav2Vec2特徵提取器處理音頻數據，實現標準化特徵輸入

模型能力

語音情感識別

音頻分類

多情感分類

使用案例

人機交互

智能客服情緒分析

分析客戶語音中的情緒狀態

提升客服響應質量和用戶體驗

心理健康

情緒狀態監測

通過語音分析用戶情緒變化

輔助心理健康評估

🚀 🎧 基於Wav2Vec2的語音情感識別

本項目藉助 Wav2Vec2 模型實現語音情感識別。旨在將音頻記錄分類為不同的情感類別，如快樂、悲傷、驚訝等。

🚀 快速開始

from transformers import AutoModelForAudioClassification, AutoFeatureExtractor
import librosa
import torch
import numpy as np

model_id = "firdhokk/speech-emotion-recognition-with-facebook-wav2vec2-large-xlsr-53"
model = AutoModelForAudioClassification.from_pretrained(model_id)

feature_extractor = AutoFeatureExtractor.from_pretrained(model_id, do_normalize=True, return_attention_mask=True)
id2label = model.config.id2label

def preprocess_audio(audio_path, feature_extractor, max_duration=30.0):
    audio_array, sampling_rate = librosa.load(audio_path, sr=feature_extractor.sampling_rate)
    
    max_length = int(feature_extractor.sampling_rate * max_duration)
    if len(audio_array) > max_length:
        audio_array = audio_array[:max_length]
    else:
        audio_array = np.pad(audio_array, (0, max_length - len(audio_array)))

    inputs = feature_extractor(
        audio_array,
        sampling_rate=feature_extractor.sampling_rate,
        max_length=max_length,
        truncation=True,
        return_attention_mask=True,
        return_tensors="pt",
    )
    return inputs

def predict_emotion(audio_path, model, feature_extractor, id2label, max_duration=30.0):
    inputs = preprocess_audio(audio_path, feature_extractor, max_duration)
    
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model = model.to(device)
    inputs = {key: value.to(device) for key, value in inputs.items()}

    with torch.no_grad():
        outputs = model(**inputs)

    logits = outputs.logits
    predicted_id = torch.argmax(logits, dim=-1).item()
    predicted_label = id2label[predicted_id]
    
    return predicted_label

audio_path = "/content/drive/MyDrive/Audio/Speech_URDU/Happy/SM5_F4_H058.wav"

predicted_emotion = predict_emotion(audio_path, model, feature_extractor, id2label)
print(f"Predicted Emotion: {predicted_emotion}")

✨ 主要特性

利用 Wav2Vec2 模型進行語音情感識別。
支持將音頻分類為多種情感類別。
訓練和評估使用了多個公開數據集。

📦 安裝指南

文檔未提及安裝步驟，此處跳過。

💻 使用示例

基礎用法

from transformers import AutoModelForAudioClassification, AutoFeatureExtractor
import librosa
import torch
import numpy as np

model_id = "firdhokk/speech-emotion-recognition-with-facebook-wav2vec2-large-xlsr-53"
model = AutoModelForAudioClassification.from_pretrained(model_id)

feature_extractor = AutoFeatureExtractor.from_pretrained(model_id, do_normalize=True, return_attention_mask=True)
id2label = model.config.id2label

高級用法

def preprocess_audio(audio_path, feature_extractor, max_duration=30.0):
    audio_array, sampling_rate = librosa.load(audio_path, sr=feature_extractor.sampling_rate)
    
    max_length = int(feature_extractor.sampling_rate * max_duration)
    if len(audio_array) > max_length:
        audio_array = audio_array[:max_length]
    else:
        audio_array = np.pad(audio_array, (0, max_length - len(audio_array)))

    inputs = feature_extractor(
        audio_array,
        sampling_rate=feature_extractor.sampling_rate,
        max_length=max_length,
        truncation=True,
        return_attention_mask=True,
        return_tensors="pt",
    )
    return inputs

def predict_emotion(audio_path, model, feature_extractor, id2label, max_duration=30.0):
    inputs = preprocess_audio(audio_path, feature_extractor, max_duration)
    
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model = model.to(device)
    inputs = {key: value.to(device) for key, value in inputs.items()}

    with torch.no_grad():
        outputs = model(**inputs)

    logits = outputs.logits
    predicted_id = torch.argmax(logits, dim=-1).item()
    predicted_label = id2label[predicted_id]
    
    return predicted_label

audio_path = "/content/drive/MyDrive/Audio/Speech_URDU/Happy/SM5_F4_H058.wav"

predicted_emotion = predict_emotion(audio_path, model, feature_extractor, id2label)
print(f"Predicted Emotion: {predicted_emotion}")

📚 詳細文檔

🗂 數據集

訓練和評估使用的數據集來自多個公開數據集，包括：

數據集包含標註了各種情感的錄音。以下是數據集中情感的分佈情況：

情感	數量
悲傷	752
快樂	752
憤怒	752
中立	716
厭惡	652
恐懼	652
驚訝	652
平靜	192

這種分佈反映了數據集中情感的平衡情況，有些情感的樣本比其他情感多。由於“平靜”情感的樣本數量不足，在訓練過程中排除了該情感。

🎤 預處理

音頻加載：使用 Librosa 加載音頻文件並將其轉換為 numpy 數組。
特徵提取：使用 Wav2Vec2 特徵提取器 處理音頻數據，對音頻特徵進行標準化和歸一化，以便輸入到模型中。

🔧 模型

使用的模型是 Wav2Vec2 Large XLR-53 模型，並針對 音頻分類 任務進行了微調：

模型：facebook/wav2vec2-large-xlsr-53
輸出：情感標籤 (憤怒', '厭惡', '恐懼', '快樂', '中立', '悲傷', '驚訝') 將情感標籤映射為數字 ID，並用於模型的訓練和評估。

⚙️ 訓練

模型使用以下參數進行訓練：

學習率：5e-05
訓練批次大小：2
評估批次大小：2
隨機種子：42
梯度累積步數：5
總訓練批次大小：10（梯度累積後的有效批次大小）
優化器：Adam，參數為 betas=(0.9, 0.999) 和 epsilon=1e-08
學習率調度器：linear
學習率調度器的熱身比例：0.1
訓練輪數：25
混合精度訓練：原生 AMP（自動混合精度）

這些參數確保了模型訓練的效率和穩定性，特別是在處理像 Wav2Vec2 這樣的大型數據集和深度模型時。訓練過程使用 Wandb 進行實驗跟蹤和監控。

📊 指標

模型訓練後獲得的評估指標如下：

損失：0.4989
準確率：0.9168
精確率：0.9209
召回率：0.9168
F1 分數：0.9166

這些指標展示了模型在語音情感識別任務上的性能。準確率、精確率、召回率和 F1 分數的高值表明，模型能夠有效地從語音數據中識別情感狀態。

🧪 結果

訓練完成後，在測試數據集上對模型進行評估，並使用 Wandb 在此鏈接監控結果。

訓練損失	輪數	步數	驗證損失	準確率	精確率	召回率	F1 分數
1.9343	0.9995	394	1.9277	0.2505	0.1425	0.2505	0.1691
1.7944	1.9990	788	1.6446	0.4574	0.5759	0.4574	0.4213
1.4601	2.9985	1182	1.3242	0.5953	0.6183	0.5953	0.5709
1.0551	3.9980	1576	1.0764	0.6623	0.6659	0.6623	0.6447
0.8934	5.0	1971	0.9209	0.7059	0.7172	0.7059	0.6825
1.1156	5.9995	2365	0.8292	0.7465	0.7635	0.7465	0.7442
0.6307	6.9990	2759	0.6439	0.8043	0.8090	0.8043	0.8020
0.774	7.9985	3153	0.6666	0.7921	0.8117	0.7921	0.7916
0.5537	8.9980	3547	0.5111	0.8245	0.8268	0.8245	0.8205
0.3762	10.0	3942	0.5506	0.8306	0.8390	0.8306	0.8296
0.716	10.9995	4336	0.5499	0.8276	0.8465	0.8276	0.8268
0.5372	11.9990	4730	0.5463	0.8377	0.8606	0.8377	0.8404
0.3746	12.9985	5124	0.4758	0.8611	0.8714	0.8611	0.8597
0.4317	13.9980	5518	0.4438	0.8742	0.8843	0.8742	0.8756
0.2104	15.0	5913	0.4426	0.8803	0.8864	0.8803	0.8806
0.3193	15.9995	6307	0.4741	0.8671	0.8751	0.8671	0.8683
0.3445	16.9990	6701	0.3850	0.9037	0.9047	0.9037	0.9038
0.2777	17.9985	7095	0.4802	0.8834	0.8923	0.8834	0.8836
0.4406	18.9980	7489	0.4053	0.9047	0.9096	0.9047	0.9043
0.1707	20.0	7884	0.4434	0.9067	0.9129	0.9067	0.9069
0.2138	20.9995	8278	0.5051	0.9037	0.9155	0.9037	0.9053
0.1812	21.9990	8672	0.4238	0.8955	0.9007	0.8955	0.8953
0.3639	22.9985	9066	0.4021	0.9138	0.9182	0.9138	0.9143
0.3193	23.9980	9460	0.4989	0.9168	0.9209	0.9168	0.9166
0.2067	24.9873	9850	0.4959	0.8976	0.9032	0.8976	0.8975