๐ ByT5-base fine-tuned for Hate Speech Detection (on Tweets)
ByT5-base fine-tuned on the tweets hate speech detection dataset for sequence classification.
This is a fine-tuned version of ByT5 on the tweets hate speech detection dataset for the Sequence Classification downstream task.
โจ Features
Details of ByT5 - Base ๐ง
ByT5 is a tokenizer-free version of Google's T5 and generally follows the architecture of MT5. It was only pre-trained on mC4 without any supervised training, using an average span-mask of 20 UTF-8 characters. Thus, this model needs to be fine-tuned before it can be used for a downstream task. ByT5 performs particularly well on noisy text data. For example, google/byt5-base
significantly outperforms mt5-base on TweetQA.
Details of the downstream task (Sequence Classification as Text generation) - Dataset ๐
The tweets_hate_speech_detection dataset aims to detect hate speech in tweets. For simplicity, a tweet is considered to contain hate speech if it has a racist or sexist sentiment. The task is to classify racist or sexist tweets from other tweets.
Formally, given a training sample of tweets and labels (where label โ1โ denotes the tweet is racist/sexist and label โ0โ denotes the tweet is not), the objective is to predict the labels on the given test dataset.
- Data Instances:
The dataset contains a label indicating whether a tweet is hate speech or not.
{'label': 0, # not a hate speech
'tweet': ' @user when a father is dysfunctional and is so selfish he drags his kids into his dysfunction. #run'}
- Data Fields:
- label: 1 - it is a hate speech, 0 - not a hate speech
- tweet: content of the tweet as a string
- Data Splits:
The data contains training data with 31962 entries.
Test set metrics ๐งพ
A representative test set was created with 5% of the entries. Due to the imbalanced dataset, the model achieved a F1 score of 79.8.
๐ฆ Installation
git clone https://github.com/huggingface/transformers.git
pip install -q ./transformers
๐ป Usage Examples
Basic Usage
from transformers import AutoTokenizer, T5ForConditionalGeneration
ckpt = 'Narrativa/byt5-base-tweet-hate-detection'
tokenizer = AutoTokenizer.from_pretrained(ckpt)
model = T5ForConditionalGeneration.from_pretrained(ckpt).to("cuda")
def classify_tweet(tweet):
inputs = tokenizer([tweet], padding='max_length', truncation=True, max_length=512, return_tensors='pt')
input_ids = inputs.input_ids.to('cuda')
attention_mask = inputs.attention_mask.to('cuda')
output = model.generate(input_ids, attention_mask=attention_mask)
return tokenizer.decode(output[0], skip_special_tokens=True)
classify_tweet('here goes your tweet...')
๐ License
This model is created by Narrativa. Narrativa focuses on Natural Language Generation (NLG). Gabriele, its machine learning-based platform, builds and deploys natural language solutions. #NLG #AI