wav2vec2 - xls - r - 300m - emotion - ruオープンソースモデル - 無料で展開してロシア語の音声の感情を識別

ホーム

Wav2vec2 Xls R 300m Emotion Ru

KELONMYOSAによって開発

facebook/wav2vec2-xls-r-300mをファインチューニングしたロシア語音声感情認識モデルで、中立、ポジティブ、怒り、悲しみなどの感情を識別可能。

音声分類

Transformers

その他オープンソースライセンス:Apache-2.0 #ロシア語音声感情認識 #高精度分類 #仮想アシスタント相互作用分析

ダウンロード数 61

リリース時間 : 5/25/2023

モデル概要

このモデルは音声感情認識（SER）タスク用で、ロシア語音声に最適化されており、5つの感情状態を識別できます。

モデル特徴

多感情認識

中立、ポジティブ、怒り、悲しみ、その他の5つの感情状態を識別可能

ロシア語最適化

ロシア語音声データに特化してファインチューニング

高精度

検証データセットで90.14%の精度を達成

モデル能力

音声感情分類

ロシア語音声分析

リアルタイム感情認識

使用事例

仮想アシスタント

感情感知対話システム

ユーザーの音声感情に基づいて仮想アシスタントの応答戦略を調整

ユーザー体験とインタラクションの自然さを向上

カスタマーサポート分析

顧客感情モニタリング

カスタマーサポート通話中の顧客感情変化を自動分析

高怒りリスク通話を識別し警告

🚀 音声感情認識

このモデルは、音声感情認識（SER）タスク用に facebook/wav2vec2-xls-r-300m をファインチューニングしたバージョンです。

元の事前学習モデルをファインチューニングするために使用されたデータセットは DUSHAデータセットです。このデータセットは、仮想アシスタントとの対話で通常出現する4つの基本的な感情（幸福（ポジティブ）、悲しみ、怒り、中立）を含む約125,000のロシア語の音声録音で構成されています。

emotions = ['neutral', 'positive', 'angry', 'sad', 'other']

🚀 クイックスタート

💻 使用例

基本的な使用法

from transformers.pipelines import pipeline

pipe = pipeline(model="KELONMYOSA/wav2vec2-xls-r-300m-emotion-ru", trust_remote_code=True)

# The pipeline input can be a file, path or link
result = pipe("speech.wav")
print(result)

[{'label': 'neutral', 'score': 0.00318}, {'label': 'positive', 'score': 0.00376}, {'label': 'sad', 'score': 0.00145}, {'label': 'angry', 'score': 0.98984}, {'label': 'other', 'score': 0.00176}]

高度な使用法

import librosa
import torch
import torch.nn.functional as F
from transformers import AutoConfig, Wav2Vec2Processor, AutoModelForAudioClassification

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_name_or_path = "KELONMYOSA/wav2vec2-xls-r-300m-emotion-ru"
config = AutoConfig.from_pretrained(model_name_or_path)
processor = Wav2Vec2Processor.from_pretrained(model_name_or_path)
sampling_rate = processor.feature_extractor.sampling_rate
model = AutoModelForAudioClassification.from_pretrained(model_name_or_path, trust_remote_code=True).to(device)


def predict(path):
    speech, sr = librosa.load(path, sr=sampling_rate)
    features = processor(speech, sampling_rate=sampling_rate, return_tensors="pt", padding=True)

    input_values = features.input_values.to(device)
    attention_mask = features.attention_mask.to(device)

    with torch.no_grad():
        logits = model(input_values, attention_mask=attention_mask).logits

    scores = F.softmax(logits, dim=1).detach().cpu().numpy()[0]
    outputs = [{"label": config.id2label[i], "score": round(score, 5)} for i, score in
               enumerate(scores)]
    return outputs


print(predict("speech.wav"))

[{'label': 'neutral', 'score': 0.00318}, {'label': 'positive', 'score': 0.00376}, {'label': 'sad', 'score': 0.00145}, {'label': 'angry', 'score': 0.98984}, {'label': 'other', 'score': 0.00176}]

📚 ドキュメント

評価

このモデルは以下の結果を達成しています。

訓練損失: 0.528700
検証損失: 0.349617
正解率: 0.901369

感情	適合率	再現率	F1値	サポート
中立	0.92	0.94	0.93	15886
ポジティブ	0.85	0.79	0.82	2481
悲しみ	0.77	0.82	0.79	2506
怒り	0.89	0.83	0.86	3072
その他	0.99	0.74	0.85	226
正解率			0.90	24171
マクロ平均	0.89	0.82	0.85	24171
加重平均	0.90	0.90	0.90	24171