HateBERTimbau-yt-tt Open-source Model - Free Detection of Portuguese Social Media Hate Speech

Hatebertimbau Yt Tt

Developed by knowhate

Transformer-based hate speech detection model for Portuguese social media

OtherOpen Source License:CC #Portuguese hate speech detection #Social media text classification #High-precision hate recognition

Downloads 77

Release Time : 5/11/2024

Model Overview

This model is a fine-tuned version of HateBERTimbau, specifically designed to identify hate speech in Portuguese social media texts, trained on datasets from YouTube comments and tweets.

Model Features

Portuguese language optimization

Specifically optimized for Portuguese social media content

Multi-platform applicability

Trained on data from both YouTube and Twitter, suitable for different social media environments

High-precision detection

Achieves an F1 score of 0.874 on the YouTube test set

Model Capabilities

Portuguese text classification

Hate speech recognition

Social media content analysis

Use Cases

Content moderation

Social media hate speech filtering

Automatically identifies and flags hate speech content on social media

F1 score of 0.874 on the YouTube test set

Social research

Hate speech trend analysis

Analyzes the distribution and characteristics of hate speech in Portuguese social media

🚀 HateBERTimbau-YouTube-Twitter

HateBERTimbau-YouTube-Twitter is a transformer-based encoder model designed to identify hate speech in Portuguese social media text. It's a fine - tuned version of the HateBERTimbau model, retrained on a dataset of 45,458 online messages (23,912 YouTube comments and 21,546 tweets) specifically focused on hate speech.

✨ Features

Developed by kNOwHATE: kNOwing online HATE speech: knowledge + awareness = TacklingHate.
Funded by European Union.
A transformer - based text classification model fine - tuned for hate speech detection in Portuguese social media text.
Supports Portuguese language.
Fine - tuned from knowhate/HateBERTimbau.

🚀 Quick Start

You can use this model in two ways:

💻 Usage Examples

Basic Usage

You can use this model directly with a pipeline for text classification:

from transformers import pipeline
classifier = pipeline('text-classification', model='knowhate/HateBERTimbau-yt-tt')

classifier("as pessoas tem que perceber que ser 'panasca' não é deixar de ser homem, é deixar de ser humano 😂😂")

[{'label': 'Hate Speech', 'score': 0.9959186911582947}]

Advanced Usage

This model can also be used by fine - tuning it for a specific task/dataset:

from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
from datasets import load_dataset

tokenizer = AutoTokenizer.from_pretrained("knowhate/HateBERTimbau-yt-tt")
model = AutoModelForSequenceClassification.from_pretrained("knowhate/HateBERTimbau-yt-tt")
dataset = load_dataset("knowhate/youtube-train")

def tokenize_function(examples):
    return tokenizer(examples["sentence1"], examples["sentence2"], padding="max_length", truncation=True)

tokenized_datasets = dataset.map(tokenize_function, batched=True)

training_args = TrainingArguments(output_dir="hatebertimbau", evaluation_strategy="epoch")
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
)

trainer.train()

📚 Documentation

Model Description

Property	Details
Developed by	kNOwHATE: kNOwing online HATE speech: knowledge + awareness = TacklingHate
Funded by	European Union
Model Type	Transformer - based text classification model fine - tuned for Hate Speech detection in Portuguese social media text
Language	Portuguese
Fine - tuned from model	knowhate/HateBERTimbau

Training

Data

23,912 YouTube comments and 21,546 tweets, a total of 45,458 online messages associated with offensive content, were used to fine - tune the base model.

Training Hyperparameters

Batch Size: 32
Epochs: 3
Learning Rate: 2e - 5 with Adam optimizer
Maximum Sequence Length: 350 tokens

Testing

Data

The datasets used to test this model were: [knowhate/youtube - test](https://huggingface.co/datasets/knowhate/youtube - test) and [knowhate/twitter - test](https://huggingface.co/datasets/knowhate/twitter - test).

Results

Dataset	Precision	Recall	F1 - score
knowhate/youtube - test	0.867	0.892	0.874
knowhate/twitter - test	0.397	0.627	0.486

BibTeX Citation

Currently in Peer Review

@article{

}

Acknowledgements

This work was funded in part by the European Union under Grant CERV - 2021 - EQUAL (101049306). However, the views and opinions expressed are those of the author(s) only and do not necessarily reflect those of the European Union or Knowhate Project. Neither the European Union nor the Knowhate Project can be held responsible.

📄 License

The license of this project is CC.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご