🚀 HateBERTimbau-YouTube-Twitter
HateBERTimbau-YouTube-Twitter is a transformer-based encoder model designed to identify hate speech in Portuguese social media text. It's a fine - tuned version of the HateBERTimbau model, retrained on a dataset of 45,458 online messages (23,912 YouTube comments and 21,546 tweets) specifically focused on hate speech.
✨ Features
🚀 Quick Start
You can use this model in two ways:
💻 Usage Examples
Basic Usage
You can use this model directly with a pipeline for text classification:
from transformers import pipeline
classifier = pipeline('text-classification', model='knowhate/HateBERTimbau-yt-tt')
classifier("as pessoas tem que perceber que ser 'panasca' não é deixar de ser homem, é deixar de ser humano 😂😂")
[{'label': 'Hate Speech', 'score': 0.9959186911582947}]
Advanced Usage
This model can also be used by fine - tuning it for a specific task/dataset:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
from datasets import load_dataset
tokenizer = AutoTokenizer.from_pretrained("knowhate/HateBERTimbau-yt-tt")
model = AutoModelForSequenceClassification.from_pretrained("knowhate/HateBERTimbau-yt-tt")
dataset = load_dataset("knowhate/youtube-train")
def tokenize_function(examples):
return tokenizer(examples["sentence1"], examples["sentence2"], padding="max_length", truncation=True)
tokenized_datasets = dataset.map(tokenize_function, batched=True)
training_args = TrainingArguments(output_dir="hatebertimbau", evaluation_strategy="epoch")
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_datasets["train"],
eval_dataset=tokenized_datasets["validation"],
)
trainer.train()
📚 Documentation
Model Description
Training
Data
23,912 YouTube comments and 21,546 tweets, a total of 45,458 online messages associated with offensive content, were used to fine - tune the base model.
Training Hyperparameters
- Batch Size: 32
- Epochs: 3
- Learning Rate: 2e - 5 with Adam optimizer
- Maximum Sequence Length: 350 tokens
Testing
Data
The datasets used to test this model were: [knowhate/youtube - test](https://huggingface.co/datasets/knowhate/youtube - test) and [knowhate/twitter - test](https://huggingface.co/datasets/knowhate/twitter - test).
Results
Dataset |
Precision |
Recall |
F1 - score |
knowhate/youtube - test |
0.867 |
0.892 |
0.874 |
knowhate/twitter - test |
0.397 |
0.627 |
0.486 |
BibTeX Citation
Currently in Peer Review
@article{
}
Acknowledgements
This work was funded in part by the European Union under Grant CERV - 2021 - EQUAL (101049306). However, the views and opinions expressed are those of the author(s) only and do not necessarily reflect those of the European Union or Knowhate Project. Neither the European Union nor the Knowhate Project can be held responsible.
📄 License
The license of this project is CC.