xlm-roberta-Twitter-spam-classificationオープンソースモデル - ツイッター/Xプラットフォームのスパムコンテンツを高精度で識別

ホーム

Xlm Roberta Twitter Spam Classification

cja5553によって開発

xlm-roberta-largeをファインチューニングしたツイッター/Xプラットフォームのスパムコンテンツ分類モデルで、ツイートがスパムかどうかを識別可能

テキスト分類

Transformers

英語オープンソースライセンス:MIT #Xプラットフォームスパム検出 #多言語ツイート分類 #高精度テキストフィルタリング

ダウンロード数 20

リリース時間 : 11/9/2024

モデル概要

このモデルはXプラットフォーム（旧Twitter）のツイートを「スパムコンテンツ」または「高品質コンテンツ」に分類するために使用され、UtkMlツイッタースパム検出データセットでファインチューニングされています

モデル特徴

高精度

テストデータセットで97.4%のF1スコアを達成

多言語対応

xlm-roberta-largeアーキテクチャに基づき、多言語処理の可能性を有する

バッチ推論

ツイートのバッチ処理をサポートし、GPU使用効率を最適化

モデル能力

テキスト分類

スパムコンテンツ識別

ソーシャルメディアコンテンツ分析

使用事例

コンテンツ審査

スパムツイート自動フィルタリング

ソーシャルメディアプラットフォーム上のスパムコンテンツを自動識別・フィルタリング

97.4%のスパムコンテンツを正確に識別

データ分析

ソーシャルメディアコンテンツ品質分析

ツイートコンテンツの品質分布を分析

🚀 ツイートのスパム検出

このモデルは、X（旧Twitter）のツイートを「スパム」（1）または「良質」（0）に分類します。

🚀 クイックスタート

このモデルは、X（旧Twitter）からのツイートを「スパム」（1）または「良質」（0）に分類します。

✨ 主な機能

X（旧Twitter）のツイートをスパムと良質に分類する機能。
高精度な分類性能を持ち、テストセットでの精度は0.974555に達します。

📦 インストール

このドキュメントには具体的なインストール手順が記載されていないため、このセクションをスキップします。

💻 使用例

基本的な使用法

def classify_texts(df, text_col, model_path="cja5553/xlm-roberta-Twitter-spam-classification", batch_size=24):
    '''
    Classifies texts as either "Quality" or "Spam" using a pre-trained sequence classification model.

    Parameters:
    -----------
    df : pandas.DataFrame
        DataFrame containing the texts to classify.
    
    text_col : str
        Name of the column in that contains the text data to be classified.
    
    model_path : str, default="cja5553/xlm-roberta-Twitter-spam-classification"
        Path to the pre-trained model for sequence classification.
    
    batch_size : int, optional, default=24
        Batch size for loading and processing data in batches. Adjust based on available GPU memory.

    Returns:
    --------
    pandas.DataFrame
        The original DataFrame with an additional column `spam_prediction`, containing the predicted labels ("Quality" or "Spam") for each text.

    '''
    # Load the tokenizer and model
    tokenizer = AutoTokenizer.from_pretrained(model_path)
    model = AutoModelForSequenceClassification.from_pretrained(model_path).to("cuda")
    model.eval()  # Set model to evaluation mode
    
    # Prepare the text data for classification
    df["text"] = df[text_col].astype(str)  # Ensure text is in string format

    # Convert the data to a Hugging Face Dataset and tokenize
    text_dataset = Dataset.from_pandas(df)
    
    def tokenize_function(example):
        return tokenizer(
            example["text"],
            padding="max_length",
            truncation=True,
            max_length=512
        )
    
    text_dataset = text_dataset.map(tokenize_function, batched=True)
    text_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask'])
    
    # DataLoader for the text data
    text_loader = DataLoader(text_dataset, batch_size=batch_size)
    
    # Make predictions
    predictions = []
    with torch.no_grad():
        for batch in tqdm_notebook(text_loader):
            input_ids = batch['input_ids'].to("cuda")
            attention_mask = batch['attention_mask'].to("cuda")
            
            # Forward pass
            outputs = model(input_ids=input_ids, attention_mask=attention_mask)
            logits = outputs.logits
            preds = torch.argmax(logits, dim=-1).cpu().numpy()  # Get predicted labels
            predictions.extend(preds)
    
    # Map predictions to labels
    id2label = {0: "Quality", 1: "Spam"}
    predicted_labels = [id2label[pred] for pred in predictions]
    
    # Add predictions to the original DataFrame
    df["spam_prediction"] = predicted_labels
    
    return df

spam_df_classification = classify_texts(df, "text_col")
print(spam_df_classification)