đ WebOrganizer/TopicClassifier-NoURL
The TopicClassifier-NoURL organizes web content into 17 categories based on the text contents of web pages (without using URL information).
[Paper] [Website] [GitHub]
đ Quick Start
The TopicClassifier-NoURL organizes web content into 17 categories based on the text contents of web pages (without using URL information). The model is a gte-base-en-v1.5 with 140M parameters fine-tuned on specific training data.
⨠Features
All Domain Classifiers
Training Data
The model is fine - tuned on the following training data:
- WebOrganizer/TopicAnnotations-Llama-3.1-8B: 1M documents annotated by Llama-3.1-8B (first-stage training)
- WebOrganizer/TopicAnnotations-Llama-3.1-405B-FP8: 100K documents annotated by Llama-3.1-405B-FP8 (second-stage training)
Model Information
Property |
Details |
Library Name |
transformers |
Datasets |
WebOrganizer/TopicAnnotations-Llama-3.1-8B, WebOrganizer/TopicAnnotations-Llama-3.1-405B-FP8 |
Base Model |
Alibaba-NLP/gte-base-en-v1.5 |
đģ Usage Examples
Basic Usage
This classifier expects input in the following format:
{text}
Example:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained("WebOrganizer/TopicClassifier-NoURL")
model = AutoModelForSequenceClassification.from_pretrained(
"WebOrganizer/TopicClassifier-NoURL",
trust_remote_code=True,
use_memory_efficient_attention=False)
web_page = """How to build a computer from scratch? Here are the components you need..."""
inputs = tokenizer([web_page], return_tensors="pt")
outputs = model(**inputs)
probs = outputs.logits.softmax(dim=-1)
print(probs.argmax(dim=-1))
You can convert the logits
of the model with a softmax to obtain a probability distribution over the following 24 categories (in order of labels, also see id2label
and label2id
in the model config):
- Adult
- Art & Design
- Software Dev.
- Crime & Law
- Education & Jobs
- Hardware
- Entertainment
- Social Life
- Fashion & Beauty
- Finance & Business
- Food & Dining
- Games
- Health
- History
- Home & Hobbies
- Industrial
- Literature
- Politics
- Religion
- Science & Tech.
- Software
- Sports & Fitness
- Transportation
- Travel
The full definitions of the categories can be found in the taxonomy config.
Advanced Usage
Efficient Inference
We recommend that you use the efficient gte-base-en-v1.5 implementation by enabling unpadding and memory efficient attention. This requires installing xformers
(see more here) and loading the model like:
AutoModelForSequenceClassification.from_pretrained(
"WebOrganizer/TopicClassifier-NoURL",
trust_remote_code=True,
unpad_inputs=True,
use_memory_efficient_attention=True,
torch_dtype=torch.bfloat16
)
đ Documentation
Citation
@article{wettig2025organize,
title={Organize the Web: Constructing Domains Enhances Pre-Training Data Curation},
author={Alexander Wettig and Kyle Lo and Sewon Min and Hannaneh Hajishirzi and Danqi Chen and Luca Soldaini},
journal={arXiv preprint arXiv:2502.10341},
year={2025}
}