hubert-large-turkish-speech-emotion-recognitionオープンソースモデル - トルコ語音声の4つの感情を精度よく識別

ホーム

Hubert Large Turkish Speech Emotion Recognition

SeaBenSeaによって開発

HuBERTアーキテクチャに基づくトルコ語音声感情認識モデルで、TurEV-DBデータセットで訓練され、怒り、平静、喜び、悲しみの4つの感情を識別できます。

音声分類

Transformers

その他オープンソースライセンス:Apache-2.0 #トルコ語音声感情認識 #HuBERT大規模モデル #高精度感情分類

ダウンロード数 95

リリース時間 : 6/25/2024

モデル概要

このモデルはHuBERTアーキテクチャを使用してトルコ語の音声感情認識を行い、主な機能は入力されたトルコ語音声の感情分類で、4つの基本感情認識をサポートします。

モデル特徴

高精度感情認識

TurEV-DBデータセットで95%の全体精度を達成し、怒り感情認識のF1スコアは0.98に達します

トルコ語専用

トルコ語音声に特化して最適化された感情認識モデル

多感情分類

怒り、平静、喜び、悲しみの4つの基本感情を識別可能

モデル能力

トルコ語音声感情認識

音声感情分類

音声信号処理

使用事例

感情分析

カスタマーサービス音声分析

カスタマーサービス通話中の顧客感情状態を分析

顧客の怒り感情を識別し、サービス品質の改善に役立ちます

メンタルヘルスモニタリング

音声分析を通じてユーザーの情緒状態を把握

うつ病などのメンタルヘルス状態の早期識別を支援できます

🚀 トルコ語音声における感情認識におけるHuBERTの活用

このHuBERTモデルは、トルコ語の音声感情認識（SER）を実現するために、TurEV - DB で学習されています。

🚀 クイックスタート

📦 インストール

必要なパッケージのインストール

# 必要なパッケージ
!pip install git+https://github.com/huggingface/datasets.git
!pip install git+https://github.com/huggingface/transformers.git
!pip install torchaudio
!pip install librosa

リポジトリのクローン

!git clone https://github.com/SeaBenSea/HuBERT-SER.git

💻 使用例

基本的な使用法

import sys  
sys.path.insert(1, './HuBERT-SER/')
import torch
import torch.nn as nn
import torch.nn.functional as F
import torchaudio
from transformers import AutoConfig, Wav2Vec2FeatureExtractor
from src.models import Wav2Vec2ForSpeechClassification, HubertForSpeechClassification

model_name_or_path = "SeaBenSea/hubert-large-turkish-speech-emotion-recognition"

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
config = AutoConfig.from_pretrained(model_name_or_path)
feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained(model_name_or_path)
sampling_rate = feature_extractor.sampling_rate

model = HubertForSpeechClassification.from_pretrained(model_name_or_path).to(device)

def speech_file_to_array_fn(path, sampling_rate):
    speech_array, _sampling_rate = torchaudio.load(path)
    resampler = torchaudio.transforms.Resample(_sampling_rate, sampling_rate)
    speech = resampler(speech_array).squeeze().numpy()
    return speech


def predict(path, sampling_rate):
    speech = speech_file_to_array_fn(path, sampling_rate)
    inputs = feature_extractor(speech, sampling_rate=sampling_rate, return_tensors="pt", padding=True)
    inputs = {key: inputs[key].to(device) for key in inputs}

    with torch.no_grad():
        logits = model(**inputs).logits

    scores = F.softmax(logits, dim=1).detach().cpu().numpy()[0]
    outputs = [{"Emotion": config.id2label[i], "Score": f"{round(score * 100, 3):.1f}%"} for i, score in
               enumerate(scores)]
    return outputs

path = "../dataset/TurEV/Angry/1157_kz_acik.wav"
outputs = predict(path, sampling_rate)
outputs

[
  {'Emotion': 'Angry', 'Score': '99.8%'},
  {'Emotion': 'Calm', 'Score': '0.0%'},
  {'Emotion': 'Happy', 'Score': '0.1%'},
  {'Emotion': 'Sad', 'Score': '0.1%'}
]