๐ Gender Prediction from Text
This model predicts the likely gender of an anonymous speaker or writer based solely on the content of an English text. It is built upon DeBERTa-v3-large and fine-tuned on a diverse, multilingual, and multi-domain dataset with both formal and informal texts.
๐ Quick Start
โจ Features
- Predicts the gender of an anonymous speaker or writer from English text.
- Built on DeBERTa-v3-large and fine - tuned on a diverse dataset.
๐ฆ Installation
No specific installation steps are provided in the original document.
๐ป Usage Examples
Basic Usage
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import torch.nn.functional as F
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_name = "fc63/gender_prediction_model_from_text"
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False)
model = AutoModelForSequenceClassification.from_pretrained(model_name).eval().to(device)
def predict(text):
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=128).to(device)
with torch.no_grad():
outputs = model(**inputs)
probs = F.softmax(outputs.logits, dim=1)
pred = torch.argmax(probs, dim=1).item()
confidence = round(probs[0][pred].item() * 100, 1)
gender = "Female" if pred == 0 else "Male"
return f"{gender} (Confidence: {confidence}%)"
sample_text = "I love writing in my journal every night. It helps me reflect on the day and plan for tomorrow."
print(predict(sample_text))
The output of this sample:
Female (Confidence: 84.1%)
๐ Documentation
Model Information
Property |
Details |
Model Type |
text-classification |
Base Model |
microsoft/deberta-v3-large |
Pipeline Tag |
text-classification |
Training Datasets |
samzirbo/europarl.en-es.gendered, czyzi0/luna-speech-dataset, czyzi0/pwr-azon-speech-dataset, sagteam/author_profiling, kaushalgawri/nptel-en-tags-and-gender-v0 |
Evaluation Metrics |
accuracy, f1, precision, recall |
Model Results
The model named gender_prediction_model_from_text
has the following results in text classification:
Metric Type |
Value |
f1 |
0.69 |
accuracy |
0.69 |
Datasets Used
All datasets were normalized, translated if necessary, deduplicated, and balanced via random undersampling to ensure equal representation of both genders.
Preprocessing & Training
- Normalization: Cleaned quotes, dashes, placeholders, noise, and HTML/code from all datasets.
- Translation: Used
Helsinki-NLP/opus-mt-*
models for Polish and Russian data.
- Undersampling: Random undersampling to balance male and female samples.
- Training Strategy:
- LR Finder used to optimize learning rate (
2.66e-6
)
- Fine-tuned using early stopping on both F1 and loss
- Step-based evaluation every 250 steps
- Best checkpoint at step 24,750 saved and evaluated
- Second Phase Fine-tuning:
- Performed on full merged dataset for 2 epochs
- Used cosine learning rate scheduler and warm-up steps
Performance (on full merged test set)
Class |
Precision |
Recall |
F1-Score |
Accuracy |
Support |
Female |
0.70 |
0.65 |
0.68 |
|
591,027 |
Male |
0.68 |
0.72 |
0.70 |
|
591,027 |
Macro Avg |
0.69 |
0.69 |
0.69 |
|
1,182,054 |
Accuracy |
|
|
|
0.69 |
1,182,054 |
Future Work & Limitations
The current model has an accuracy and F1 score of 0.69. There is a bias in prediction, where emotional, psychological, and introspective texts are often predicted as female, and more direct and result - oriented writings are often predicted as male. A large, carefully labeled dataset that reflects the opposite pattern is needed.
The datasets were obtained from open - source platforms, which limited the data range. To make further progress, creating and labeling a larger dataset is required, which is time - consuming, labor - intensive, and costly.
Before creating a new dataset, the author plans to try more approaches with the current dataset. If none of them work, building a new dataset will be the next step, which may also be the end of development due to high costs.
๐ง Technical Details
The model is based on the DeBERTa - v3 - large architecture. It uses various pre - processing steps such as normalization, translation, and undersampling during training. The training strategy includes using an LR Finder, early stopping, step - based evaluation, and a cosine learning rate scheduler in the second - phase fine - tuning.
๐ License
This project is licensed under the MIT license.
Author: Furkan ลoban
Project: CENG - 481 Gender Prediction Model
Citations
- "@misc{fc63_gender1_2025,\n title = {Gender Prediction from Text},\n author = {ลoban, Furkan},\n year = {2025},\n howpublished = {\url{https://doi.org/10.5281/zenodo.15619489}},\n note = {DeBERTa-v3-large model fine-tuned on multi-domain gender-labeled texts}\n}"