マルチモーダル憎悪言論検出モデルのオープンソース公開 - タミル語などの憎悪言論を無料でデプロイして識別

ホーム

Multimodal Hate Speech Detection In Dravidian Languages

vasantharanによって開発

タミル語、マラヤーラム語、テルグ語向けのマルチモーダルヘイトスピーチ検出モデルで、テキストと音声入力の分類をサポート

マルチモーダル融合その他#マルチモーダルヘイト検出 #南アジア言語サポート #テキスト音声デュアルモーダル

ダウンロード数 95

リリース時間 : 2/11/2025

モデル概要

このモデルは、南インドのドラヴィダ語族（タミル語、マラヤーラム語、テルグ語）におけるヘイトスピーチを検出するための深層学習システムです。マルチモーダルアプローチを採用し、テキストと音声入力を同時に処理して分類結果を出力します。

モデル特徴

マルチモーダルサポート

テキストと音声入力を同時にサポートし、より包括的なヘイトスピーチ検出能力を提供

多言語対応

南インドの3つの主要ドラヴィダ言語（タミル語、マラヤーラム語、テルグ語）に特化して最適化

高性能分類

テキスト分類F1スコア0.6438、音声分類F1スコア0.88と優れた性能を発揮

モデル能力

テキスト分類

音声分類

多言語処理

ヘイトスピーチ検出

使用事例

コンテンツモデレーション

ソーシャルメディアコンテンツモデレーション

ソーシャルメディアプラットフォーム上のドラヴィダ言語におけるヘイトスピーチを自動検出

テキストと音声のヘイトコンテンツを効果的に識別可能

セキュリティ監視

オンラインコミュニティの安全

オンラインフォーラムやコミュニティ内のヘイトスピーチを監視

健全なオンライン交流環境の維持に貢献

🚀 マルチモーダル分類モデル (タミル語、マラヤーラム語、テルグ語)

このリポジトリには、タミル語、マラヤーラム語、テルグ語の3つの言語におけるテキストと音声分類のためのディープラーニングモデルが含まれています。

🚀 クイックスタート

このモデルはテキストと音声入力を受け取り、事前定義されたカテゴリに分類します。各言語には、それぞれ訓練されたモデルとラベルエンコーダが用意されています。

テキストモデル：xlm-roberta-largeを使用して特徴抽出を行い、ディープラーニング分類器を用います。
音声モデル：MFCC特徴抽出とCNNベースの分類器を使用します。

✨ 主な機能

対応言語: タミル語、マラヤーラム語、テルグ語
モダリティ: テキスト分類と音声分類
モデル指標:
- テキスト分類のマクロF1値: 0.6438
- 音声分類のマクロF1値: 0.88

📦 インストール

1. リポジトリをクローンする

git clone https://huggingface.co/vasantharan/Multimodal_Hate_Speech_Detection_in_Dravidian_languages
cd Multimodal_Hate_Speech_Detection_in_Dravidian_languages

2. 依存関係をインストールする

Pythonがインストールされていることを確認してから、以下のコマンドを実行します。

pip install -r requirements.txt

💻 使用例

基本的な使用法

モデルの読み込み

import tensorflow as tf
import pickle
import numpy as np
import torch
from transformers import AutoTokenizer, AutoModel

# ラベルエンコーダを読み込む
with open("text_label_encoders/tamil_label_encoder.pkl", "rb") as f:
    tamil_text_label_encoder = pickle.load(f)

with open("audio_label_encoders/tamil_audio_label_encoder.pkl", "rb") as f:
    tamil_audio_label_encoder = pickle.load(f)

# モデルを読み込む
text_model = tf.keras.models.load_model("text_models/tamil_text_model.h5")
audio_model = tf.keras.models.load_model("audio_models/tamil_audio_model.keras")

テキスト分類

from indicnlp.tokenize import indic_tokenize
from indicnlp.normalize.indic_normalize import IndicNormalizerFactory
import advertools as adv

stopwords = list(sorted(adv.stopwords["tamil"]))

def preprocess_tamil_text(text):
    tokens = list(indic_tokenize.trivial_tokenize(text, lang="ta"))
    tokens = [token for token in tokens if token not in stopwords]
    return " ".join(tokens)

def extract_embeddings(model_name, texts):
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModel.from_pretrained(model_name)
    model.eval()
    
    embeddings = []
    batch_size = 16
    with torch.no_grad():
        for i in range(0, len(texts), batch_size):
            batch_texts = texts[i:i + batch_size]
            encoded_inputs = tokenizer(batch_texts, padding=True, truncation=True, max_length=128, return_tensors="pt")
            outputs = model(**encoded_inputs)
            batch_embeddings = outputs.last_hidden_state.mean(dim=1).numpy()
            embeddings.extend(batch_embeddings)
    return np.array(embeddings)

feature_extractor = "xlm-roberta-large"
text = "உங்கள் உதவி மிகவும் பயனுள்ளதாக இருந்தது"
processed_text = preprocess_tamil_text(text)
text_embeddings = extract_embeddings(feature_extractor, [processed_text])

text_predictions = text_model.predict(text_embeddings)
predicted_label = tamil_text_label_encoder.inverse_transform(np.argmax(text_predictions, axis=1))
print("Predicted Label:", predicted_label[0])

音声分類

import librosa

def extract_audio_features(file_path, sr=22050, n_mfcc=40):
    audio, _ = librosa.load(file_path, sr=sr)
    mfccs = librosa.feature.mfcc(y=audio, sr=sr, n_mfcc=n_mfcc)
    return np.mean(mfccs.T, axis=0)

def predict_audio(file_path):
    features = extract_audio_features(file_path)
    reshaped_features = features.reshape((1, 40, 1, 1))
    predicted_class = np.argmax(audio_model.predict(reshaped_features), axis=1)
    predicted_label = tamil_audio_label_encoder.inverse_transform(predicted_class)
    return predicted_label[0]

audio_file = "test_audio.wav"
predicted_audio_label = predict_audio(audio_file)
print("Predicted Audio Label:", predicted_audio_label)

高度な使用法

データセットのバッチ処理

import os
import pandas as pd

def load_dataset(base_dir='../test', lang='tamil'):
    dataset = []
    lang_dir = os.path.join(base_dir, lang)
    audio_dir = os.path.join(lang_dir, "audio")
    text_dir = os.path.join(lang_dir, "text")
    
    text_file = os.path.join(text_dir, [file for file in os.listdir(text_dir) if file.endswith(".xlsx")][0])
    text_df = pd.read_excel(text_file)

    for file in text_df["File Name"]:
        if (file + ".wav") in os.listdir(audio_dir):
            audio_path = os.path.join(audio_dir, file + ".wav")
            transcript_row = text_df.loc[text_df["File Name"] == file]
            transcript = transcript_row.iloc[0]["Transcript"] if not transcript_row.empty else ""
            dataset.append({"File Name": audio_path, "Transcript": transcript})
        else:
            transcript_row = text_df.loc[text_df["File Name"] == file]
            transcript = transcript_row.iloc[0]["Transcript"] if not transcript_row.empty else ""
            dataset.append({"File Name": "Nil", "Transcript": transcript})
    
    return pd.DataFrame(dataset)

dataset_df = load_dataset()

dataset_df["Transcript"] = dataset_df["Transcript"].apply(preprocess_tamil_text)
text_embeddings = extract_embeddings(feature_extractor, dataset_df["Transcript"].tolist())
text_predictions = text_model.predict(text_embeddings)
text_labels = tamil_text_label_encoder.inverse_transform(np.argmax(text_predictions, axis=1))

dataset_df["Predicted Text Label"] = text_labels
dataset_df["Predicted Audio Label"] = dataset_df["File Name"].apply(lambda x: predict_audio(x) if x != "Nil" else "No Audio")
dataset_df.to_csv("predictions.tsv", sep="\t", index=False)

Hugging Faceへのデプロイ

pip install huggingface_hub
huggingface-cli login

from huggingface_hub import upload_file

upload_file(path_or_fileobj="text_models/tamil_text_model.h5", path_in_repo="text_models/tamil_text_model.h5", repo_id="<your-hf-repo>")

📚 ドキュメント

ディレクトリ構造

├── audio_label_encoders/       # 音声モデルのラベルエンコーダ
├── audio_models/               # 訓練済みの音声分類モデル
├── text_label_encoders/        # テキストモデルのラベルエンコーダ
└── text_models/                # 訓練済みのテキスト分類モデル

各フォルダには、タミル語、マラヤーラム語、テルグ語に対応する3つのファイルが含まれています。