đ WebOrganizer/TopicClassifier
The TopicClassifier organizes web content into 17 categories based on the URL and text contents of web pages.
[Paper] [Website] [GitHub]
đ Quick Start
The TopicClassifier organizes web content into 17 categories based on the URL and text contents of web pages. The model is a gte-base-en-v1.5 with 140M parameters fine-tuned on the following training data:
- WebOrganizer/TopicAnnotations-Llama-3.1-8B: 1M documents annotated by Llama-3.1-8B (first-stage training)
- WebOrganizer/TopicAnnotations-Llama-3.1-405B-FP8: 100K documents annotated by Llama-3.1-405B-FP8 (second-stage training)
All Domain Classifiers
⨠Features
The TopicClassifier can effectively classify web content into 17 different categories, providing a clear organization for web information. It is fine - tuned on specific datasets, which enhances its accuracy in topic classification.
đģ Usage Examples
Basic Usage
This classifier expects input in the following input format:
{url}
{text}
from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained("WebOrganizer/TopicClassifier")
model = AutoModelForSequenceClassification.from_pretrained(
"WebOrganizer/TopicClassifier",
trust_remote_code=True,
use_memory_efficient_attention=False)
web_page = """http://www.example.com
How to build a computer from scratch? Here are the components you need..."""
inputs = tokenizer([web_page], return_tensors="pt")
outputs = model(**inputs)
probs = outputs.logits.softmax(dim=-1)
print(probs.argmax(dim=-1))
You can convert the logits
of the model with a softmax to obtain a probability distribution over the following 24 categories (in order of labels, also see id2label
and label2id
in the model config):
- Adult
- Art & Design
- Software Dev.
- Crime & Law
- Education & Jobs
- Hardware
- Entertainment
- Social Life
- Fashion & Beauty
- Finance & Business
- Food & Dining
- Games
- Health
- History
- Home & Hobbies
- Industrial
- Literature
- Politics
- Religion
- Science & Tech.
- Software
- Sports & Fitness
- Transportation
- Travel
The full definitions of the categories can be found in the taxonomy config.
Advanced Usage
Efficient Inference
We recommend that you use the efficient gte-base-en-v1.5 implementation by enabling unpadding and memory efficient attention. This requires installing xformers
(see more here) and loading the model like:
AutoModelForSequenceClassification.from_pretrained(
"WebOrganizer/TopicClassifier",
trust_remote_code=True,
unpad_inputs=True,
use_memory_efficient_attention=True,
torch_dtype=torch.bfloat16
)
đ Documentation
The model takes web page URL and text as input and classifies the content into one of the 24 predefined categories. The input format is clearly defined, and the output can be further processed to get the probability distribution of each category.
đ License
No license information is provided in the original document.
đ Citation
@article{wettig2025organize,
title={Organize the Web: Constructing Domains Enhances Pre-Training Data Curation},
author={Alexander Wettig and Kyle Lo and Sewon Min and Hannaneh Hajishirzi and Danqi Chen and Luca Soldaini},
journal={arXiv preprint arXiv:2502.10341},
year={2025}
}