Open-source gender prediction model gender_prediction_model_from_text - Accurately identify the gender of the speaker based on English text

Gender Prediction Model From Text

Developed by fc63

This model is built on DeBERTa-v3-large and can predict the gender of anonymous speakers or authors based on the content of English texts.

Text Classification

Transformers

EnglishOpen Source License:MIT #English text gender prediction #Multi-domain fine-tuning #DeBERTa-v3-large

Downloads 106

Release Time : 6/7/2025

Model Overview

This is a text classification model specifically designed to predict the gender of the author based on the content of English texts. The model has been fine-tuned on diverse, multilingual, and multi-domain datasets and is suitable for both formal and informal texts.

Model Features

Multi-domain adaptability

The model performs well on both formal and informal texts and is suitable for various text types.

Training with multilingual data

Although it only supports English prediction, the training data contains multilingual texts (after translation processing).

Balanced dataset

Random undersampling is used to ensure a balanced number of male and female samples and reduce bias.

Model Capabilities

English text gender prediction

Formal text analysis

Informal text analysis

Use Cases

Social media analysis

Anonymous user gender analysis

Analyze the posting content of anonymous users on social media to predict their gender.

Accuracy is approximately 69%

Market research

Consumer review analysis

Predict the gender distribution of consumers through product reviews.

🚀 Gender Prediction from Text

This model predicts the likely gender of an anonymous speaker or writer based solely on the content of an English text. It is built upon DeBERTa-v3-large and fine-tuned on a diverse, multilingual, and multi-domain dataset with both formal and informal texts.

🚀 Quick Start

Space link: 🔗 Try it out on Hugging Face Spaces
Model repo: 🔗 View on Hugging Face Hub

✨ Features

Predicts the gender of an anonymous speaker or writer from English text.
Built on DeBERTa-v3-large and fine - tuned on a diverse dataset.

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import torch.nn.functional as F

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model_name = "fc63/gender_prediction_model_from_text"
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False)
model = AutoModelForSequenceClassification.from_pretrained(model_name).eval().to(device)

def predict(text):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=128).to(device)
    with torch.no_grad():
        outputs = model(**inputs)
        probs = F.softmax(outputs.logits, dim=1)
    pred = torch.argmax(probs, dim=1).item()
    confidence = round(probs[0][pred].item() * 100, 1)
    gender = "Female" if pred == 0 else "Male"
    return f"{gender} (Confidence: {confidence}%)"

sample_text = "I love writing in my journal every night. It helps me reflect on the day and plan for tomorrow."
print(predict(sample_text))

The output of this sample:

Female (Confidence: 84.1%)

📚 Documentation

Model Information

Property	Details
Model Type	text-classification
Base Model	microsoft/deberta-v3-large
Pipeline Tag	text-classification
Training Datasets	samzirbo/europarl.en-es.gendered, czyzi0/luna-speech-dataset, czyzi0/pwr-azon-speech-dataset, sagteam/author_profiling, kaushalgawri/nptel-en-tags-and-gender-v0
Evaluation Metrics	accuracy, f1, precision, recall

Model Results

The model named gender_prediction_model_from_text has the following results in text classification:

Metric Type	Value
f1	0.69
accuracy	0.69

Datasets Used

Dataset	Domain	Type
samzirbo/europarl.en-es.gendered	Formal speech (Parliament)	English
czyzi0/luna-speech-dataset	Phone conversations	Polish → Translated
czyzi0/pwr-azon-speech-dataset	Phone conversations	Polish → Translated
sagteam/author_profiling	Social posts	Russian → Translated
kaushalgawri/nptel-en-tags-and-gender-v0	Spoken transcripts	English
Blog Authorship Corpus	Blog posts	English

All datasets were normalized, translated if necessary, deduplicated, and balanced via random undersampling to ensure equal representation of both genders.

Preprocessing & Training

Normalization: Cleaned quotes, dashes, placeholders, noise, and HTML/code from all datasets.
Translation: Used Helsinki-NLP/opus-mt-* models for Polish and Russian data.
Undersampling: Random undersampling to balance male and female samples.
Training Strategy:
- LR Finder used to optimize learning rate (2.66e-6)
- Fine-tuned using early stopping on both F1 and loss
- Step-based evaluation every 250 steps
- Best checkpoint at step 24,750 saved and evaluated
Second Phase Fine-tuning:
- Performed on full merged dataset for 2 epochs
- Used cosine learning rate scheduler and warm-up steps

Performance (on full merged test set)

Class	Precision	Recall	F1-Score	Accuracy	Support
Female	0.70	0.65	0.68		591,027
Male	0.68	0.72	0.70		591,027
Macro Avg	0.69	0.69	0.69		1,182,054
Accuracy				0.69	1,182,054

Future Work & Limitations

The current model has an accuracy and F1 score of 0.69. There is a bias in prediction, where emotional, psychological, and introspective texts are often predicted as female, and more direct and result - oriented writings are often predicted as male. A large, carefully labeled dataset that reflects the opposite pattern is needed.

The datasets were obtained from open - source platforms, which limited the data range. To make further progress, creating and labeling a larger dataset is required, which is time - consuming, labor - intensive, and costly.

Before creating a new dataset, the author plans to try more approaches with the current dataset. If none of them work, building a new dataset will be the next step, which may also be the end of development due to high costs.

🔧 Technical Details

The model is based on the DeBERTa - v3 - large architecture. It uses various pre - processing steps such as normalization, translation, and undersampling during training. The training strategy includes using an LR Finder, early stopping, step - based evaluation, and a cosine learning rate scheduler in the second - phase fine - tuning.

📄 License

This project is licensed under the MIT license.

Author: Furkan Şoban
Project: CENG - 481 Gender Prediction Model

Citations

"@misc{fc63_gender1_2025,\n title = {Gender Prediction from Text},\n author = {Şoban, Furkan},\n year = {2025},\n howpublished = {\url{https://doi.org/10.5281/zenodo.15619489}},\n note = {DeBERTa-v3-large model fine-tuned on multi-domain gender-labeled texts}\n}"

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご