TopicClassifier-NoURL Open-Source Classification Model - Precise Classification of Web Page Content into 17 Categories without URL

Topicclassifier NoURL

Developed by WebOrganizer

A classification model that categorizes web content into 17 categories based on text content (without using URL information)

Text Classification

Transformers

Other#Web Text Classification #URL-Independent #17-Category Topic Recognition

Downloads 41.04k

Release Time : 2/10/2025

Model Overview

This model is fine-tuned on gte-base-en-v1.5, specifically designed for topic classification of web text content, supporting 17 categories.

Model Features

URL-Independent

Relies solely on web text content for classification, independent of URL information

Multi-Stage Training

Utilizes two-stage training with data annotated by Llama-3.1-8B and Llama-3.1-405B-FP8

Efficient Inference

Supports padding removal and memory-efficient attention mechanisms to enhance operational efficiency

Model Capabilities

Web Text Classification

Multi-Category Probability Distribution Output

Use Cases

Content Management

Web Content Classification

Automatically categorizes web content for easier management and organization

Outputs probability distributions across 17 categories

Content Filtering

Adult Content Filtering

Identifies and filters adult content

Can recognize adult content categories

🚀 WebOrganizer/TopicClassifier-NoURL

The TopicClassifier-NoURL organizes web content into 17 categories based on the text contents of web pages (without using URL information).

[Paper] [Website] [GitHub]

🚀 Quick Start

The TopicClassifier-NoURL organizes web content into 17 categories based on the text contents of web pages (without using URL information). The model is a gte-base-en-v1.5 with 140M parameters fine-tuned on specific training data.

✨ Features

All Domain Classifiers

WebOrganizer/FormatClassifier
WebOrganizer/FormatClassifier-NoURL
WebOrganizer/TopicClassifier
WebOrganizer/TopicClassifier-NoURL ← you are here!

Training Data

The model is fine - tuned on the following training data:

WebOrganizer/TopicAnnotations-Llama-3.1-8B: 1M documents annotated by Llama-3.1-8B (first-stage training)
WebOrganizer/TopicAnnotations-Llama-3.1-405B-FP8: 100K documents annotated by Llama-3.1-405B-FP8 (second-stage training)

Model Information

Property	Details
Library Name	transformers
Datasets	WebOrganizer/TopicAnnotations-Llama-3.1-8B, WebOrganizer/TopicAnnotations-Llama-3.1-405B-FP8
Base Model	Alibaba-NLP/gte-base-en-v1.5

💻 Usage Examples

Basic Usage

This classifier expects input in the following format:

{text}

Example:

from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("WebOrganizer/TopicClassifier-NoURL")
model = AutoModelForSequenceClassification.from_pretrained(
    "WebOrganizer/TopicClassifier-NoURL",
    trust_remote_code=True,
    use_memory_efficient_attention=False)

web_page = """How to build a computer from scratch? Here are the components you need..."""

inputs = tokenizer([web_page], return_tensors="pt")
outputs = model(**inputs)

probs = outputs.logits.softmax(dim=-1)
print(probs.argmax(dim=-1))
# -> 5 ("Hardware" topic)

You can convert the logits of the model with a softmax to obtain a probability distribution over the following 24 categories (in order of labels, also see id2label and label2id in the model config):

Adult
Art & Design
Software Dev.
Crime & Law
Education & Jobs
Hardware
Entertainment
Social Life
Fashion & Beauty
Finance & Business
Food & Dining
Games
Health
History
Home & Hobbies
Industrial
Literature
Politics
Religion
Science & Tech.
Software
Sports & Fitness
Transportation
Travel

The full definitions of the categories can be found in the taxonomy config.

Advanced Usage

Efficient Inference

We recommend that you use the efficient gte-base-en-v1.5 implementation by enabling unpadding and memory efficient attention. This requires installing xformers (see more here) and loading the model like:

AutoModelForSequenceClassification.from_pretrained(
    "WebOrganizer/TopicClassifier-NoURL",
    trust_remote_code=True,
    unpad_inputs=True,
    use_memory_efficient_attention=True,
    torch_dtype=torch.bfloat16
)

📚 Documentation

Citation

@article{wettig2025organize,
  title={Organize the Web: Constructing Domains Enhances Pre-Training Data Curation},
  author={Alexander Wettig and Kyle Lo and Sewon Min and Hannaneh Hajishirzi and Danqi Chen and Luca Soldaini},
  journal={arXiv preprint arXiv:2502.10341},
  year={2025}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご