xlm-roberta-Twitter-spam-classification Open Source Model - Accurately Identify Spam Content on Twitter/X Platform

Xlm Roberta Twitter Spam Classification

Developed by cja5553

A Twitter/X Platform spam classification model fine-tuned based on xlm-roberta-large, capable of identifying whether tweets are spam

Text Classification

Transformers

EnglishOpen Source License:MIT #X Platform Spam Detection #Multilingual Twitter Classification #High-Precision Text Filtering

Downloads 20

Release Time : 11/9/2024

Model Overview

This model is used to classify tweets on the X Platform (formerly Twitter) as 'spam' or 'quality content', fine-tuned based on the UtkMl Twitter Spam Detection dataset

Model Features

High Accuracy

Achieves an F1 score of 97.4% on the test set

Multilingual Support

Based on the xlm-roberta-large architecture, with potential for multilingual processing

Batch Inference

Supports batch processing of tweets, optimizing GPU usage efficiency

Model Capabilities

Text Classification

Spam Content Identification

Social Media Content Analysis

Use Cases

Content Moderation

Automatic Spam Tweet Filtering

Automatically identifies and filters spam content on social media platforms

Accurately identifies 97.4% of spam content

Data Analysis

Social Media Content Quality Analysis

Analyzes the distribution of tweet content quality

🚀 Spam detection of Tweets

This model classifies Tweets from X (formerly known as Twitter) into 'Spam' (1) or 'Quality' (0), providing an effective solution for spam detection in the Twitter environment.

🚀 Quick Start

This model classifies Tweets from X (formerly known as Twitter) into 'Spam' (1) or 'Quality' (0).

✨ Features

Classify Tweets from X (formerly Twitter) into 'Spam' or 'Quality'.
Finetuned on a specific dataset with a well - known base model.

📦 Installation

No installation steps are provided in the original document, so this section is skipped.

💻 Usage Examples

Basic Usage

def classify_texts(df, text_col, model_path="cja5553/xlm-roberta-Twitter-spam-classification", batch_size=24):
    '''
    Classifies texts as either "Quality" or "Spam" using a pre-trained sequence classification model.

    Parameters:
    -----------
    df : pandas.DataFrame
        DataFrame containing the texts to classify.
    
    text_col : str
        Name of the column in that contains the text data to be classified.
    
    model_path : str, default="cja5553/xlm-roberta-Twitter-spam-classification"
        Path to the pre-trained model for sequence classification.
    
    batch_size : int, optional, default=24
        Batch size for loading and processing data in batches. Adjust based on available GPU memory.

    Returns:
    --------
    pandas.DataFrame
        The original DataFrame with an additional column `spam_prediction`, containing the predicted labels ("Quality" or "Spam") for each text.

    '''
    # Load the tokenizer and model
    tokenizer = AutoTokenizer.from_pretrained(model_path)
    model = AutoModelForSequenceClassification.from_pretrained(model_path).to("cuda")
    model.eval()  # Set model to evaluation mode
    
    # Prepare the text data for classification
    df["text"] = df[text_col].astype(str)  # Ensure text is in string format

    # Convert the data to a Hugging Face Dataset and tokenize
    text_dataset = Dataset.from_pandas(df)
    
    def tokenize_function(example):
        return tokenizer(
            example["text"],
            padding="max_length",
            truncation=True,
            max_length=512
        )
    
    text_dataset = text_dataset.map(tokenize_function, batched=True)
    text_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask'])
    
    # DataLoader for the text data
    text_loader = DataLoader(text_dataset, batch_size=batch_size)
    
    # Make predictions
    predictions = []
    with torch.no_grad():
        for batch in tqdm_notebook(text_loader):
            input_ids = batch['input_ids'].to("cuda")
            attention_mask = batch['attention_mask'].to("cuda")
            
            # Forward pass
            outputs = model(input_ids=input_ids, attention_mask=attention_mask)
            logits = outputs.logits
            preds = torch.argmax(logits, dim=-1).cpu().numpy()  # Get predicted labels
            predictions.extend(preds)
    
    # Map predictions to labels
    id2label = {0: "Quality", 1: "Spam"}
    predicted_labels = [id2label[pred] for pred in predictions]
    
    # Add predictions to the original DataFrame
    df["spam_prediction"] = predicted_labels
    
    return df

spam_df_classification = classify_texts(df, "text_col")
print(spam_df_classification)

📚 Documentation

Training Dataset

This was finetuned on the UtkMl's Twitter Spam Detection dataset with FacebookAI/xlm-roberta-large as the base model.

Metrics

Based on a 80 - 10 - 10 train - val - test split, the following results were obtained on the test set:

Accuracy: 0.974555
Precision: 0.97457
Recall: 0.97455
F1 - Score: 0.97455

Code

Code used to train these models are available on GitHub at github.com/cja5553/Twitter_spam_detection

Questions?

Contact me at alba@wustl.edu

📄 License

The model is released under the MIT license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご