TopicClassifier Open-Source Topic Classification Model - Free Deployment, Accurately Classify Web Page Content into 24 Categories

Topicclassifier

Developed by WebOrganizer

A topic classification model fine-tuned based on gte-base-en-v1.5, capable of classifying web content into 24 categories

Text Classification

Transformers

Other#Web Content Classification #Multimodal Input (URL+Text)#17 Fine-grained Categories

Downloads 2,288

Release Time : 2/10/2025

Model Overview

This model can automatically categorize web content into 24 predefined topic categories based on URL and text content. Suitable for content filtering, information organization, and similar scenarios.

Model Features

Two-stage Training

First trained on 1 million documents annotated by Llama-3.1-8B, then fine-tuned on 100,000 documents annotated by Llama-3.1-405B-FP8

Dual Input (URL+Text)

Simultaneously considers both webpage URL and text content for comprehensive classification

Efficient Inference Support

Supports unpadding and memory-efficient attention mechanisms, with optional xformers acceleration

Model Capabilities

Web Content Classification

Multi-category Probability Prediction

Text Understanding

Use Cases

Content Management

Automatic Webpage Classification

Automatically categorizes scraped webpage content by topic

Accurately identifies 24 topic categories

Content Filtering

Adult Content Filtering

Identifies and filters inappropriate content

Can accurately identify adult content categories

🚀 WebOrganizer/TopicClassifier

The TopicClassifier organizes web content into 17 categories based on the URL and text contents of web pages.

[Paper] [Website] [GitHub]

🚀 Quick Start

The TopicClassifier organizes web content into 17 categories based on the URL and text contents of web pages. The model is a gte-base-en-v1.5 with 140M parameters fine-tuned on the following training data:

WebOrganizer/TopicAnnotations-Llama-3.1-8B: 1M documents annotated by Llama-3.1-8B (first-stage training)
WebOrganizer/TopicAnnotations-Llama-3.1-405B-FP8: 100K documents annotated by Llama-3.1-405B-FP8 (second-stage training)

All Domain Classifiers

WebOrganizer/FormatClassifier
WebOrganizer/FormatClassifier-NoURL
WebOrganizer/TopicClassifier ← you are here!
WebOrganizer/TopicClassifier-NoURL

✨ Features

The TopicClassifier can effectively classify web content into 17 different categories, providing a clear organization for web information. It is fine - tuned on specific datasets, which enhances its accuracy in topic classification.

💻 Usage Examples

Basic Usage

This classifier expects input in the following input format:

{url}

{text}

from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("WebOrganizer/TopicClassifier")
model = AutoModelForSequenceClassification.from_pretrained(
    "WebOrganizer/TopicClassifier",
    trust_remote_code=True,
    use_memory_efficient_attention=False)

web_page = """http://www.example.com

How to build a computer from scratch? Here are the components you need..."""

inputs = tokenizer([web_page], return_tensors="pt")
outputs = model(**inputs)

probs = outputs.logits.softmax(dim=-1)
print(probs.argmax(dim=-1))
# -> 5 ("Hardware" topic)

You can convert the logits of the model with a softmax to obtain a probability distribution over the following 24 categories (in order of labels, also see id2label and label2id in the model config):

Adult
Art & Design
Software Dev.
Crime & Law
Education & Jobs
Hardware
Entertainment
Social Life
Fashion & Beauty
Finance & Business
Food & Dining
Games
Health
History
Home & Hobbies
Industrial
Literature
Politics
Religion
Science & Tech.
Software
Sports & Fitness
Transportation
Travel

The full definitions of the categories can be found in the taxonomy config.

Advanced Usage

Efficient Inference

We recommend that you use the efficient gte-base-en-v1.5 implementation by enabling unpadding and memory efficient attention. This requires installing xformers (see more here) and loading the model like:

AutoModelForSequenceClassification.from_pretrained(
    "WebOrganizer/TopicClassifier",
    trust_remote_code=True,
    unpad_inputs=True,
    use_memory_efficient_attention=True,
    torch_dtype=torch.bfloat16
)

📚 Documentation

The model takes web page URL and text as input and classifies the content into one of the 24 predefined categories. The input format is clearly defined, and the output can be further processed to get the probability distribution of each category.

📄 License

No license information is provided in the original document.

📚 Citation

@article{wettig2025organize,
  title={Organize the Web: Constructing Domains Enhances Pre-Training Data Curation},
  author={Alexander Wettig and Kyle Lo and Sewon Min and Hannaneh Hajishirzi and Danqi Chen and Luca Soldaini},
  journal={arXiv preprint arXiv:2502.10341},
  year={2025}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご