FormatClassifier Open-source Content Classification Model - Classify Online Content into 24 Categories Based on URLs and Text

Home

Formatclassifier

Developed by WebOrganizer

The FormatClassifier model categorizes web content into 24 classes based on URL and text content.

Text Classification

Transformers

Other#Web Content Classification #Multimodal Input (URL+Text)#24 Fine-grained Categories

Downloads 2,429

Release Time : 2/10/2025

Model Overview

This model is fine-tuned from gte-base-en-v1.5 for 24-class format classification of web content, suitable for content organization and data preprocessing tasks.

Model Features

Multi-stage Training

Uses two-phase training data annotated by Llama-3.1-8B and Llama-3.1-405B-FP8

URL-aware Classification

Simultaneously utilizes URL and text content for more accurate classification

Efficient Inference

Supports xformers acceleration and memory optimization

Model Capabilities

Web Content Classification

Text Format Recognition

URL Analysis

Use Cases

Content Management

Web Content Archiving

Automatically classifies web content into predefined 24 format categories

Improves content organization efficiency

Data Preprocessing

Provides format labels for downstream tasks (e.g., search, recommendation)

Enhances downstream task performance

Information Filtering

Spam Ad Detection

Identifies and filters spam advertisement content

19 categories specifically designed for spam ad detection

🚀 WebOrganizer/FormatClassifier

The FormatClassifier is designed to categorize web content into 24 distinct categories. It makes use of both the URL and text content of web pages, offering a comprehensive solution for web content organization. The model is based on gte-base-en-v1.5 with 140M parameters, fine - tuned on specific training datasets.

[Paper] [Website] [GitHub]

✨ Features

Categorization: Organizes web content into 24 categories using URL and text information.
Fine - Tuned Model: Based on gte-base-en-v1.5 and fine - tuned on specific datasets.

📦 Installation

The installation details are not provided in the original README. If you need to install the necessary dependencies for this model, you may need to refer to the official documentation of transformers library and other related components.

💻 Usage Examples

Basic Usage

This classifier expects input in the following input format:

{url}

{text}

from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("WebOrganizer/FormatClassifier")
model = AutoModelForSequenceClassification.from_pretrained(
    "WebOrganizer/FormatClassifier",
    trust_remote_code=True,
    use_memory_efficient_attention=False)

web_page = """http://www.example.com

How to make a good sandwich? [Click here to read article]"""

inputs = tokenizer([web_page], return_tensors="pt")
outputs = model(**inputs)

probs = outputs.logits.softmax(dim=-1)
print(probs.argmax(dim=-1))
# -> 6 ("Truncated" format, which covers incomplete content)

Advanced Usage

You can convert the logits of the model with a softmax to obtain a probability distribution over the following 24 categories (in order of labels, also see id2label and label2id in the model config):

Academic Writing
Content Listing
Creative Writing
Customer Support
Comment Section
FAQ
Truncated
Knowledge Article
Legal Notices
Listicle
News Article
Nonfiction Writing
About (Org.)
News (Org.)
About (Pers.)
Personal Blog
Product Page
Q&A Forum
Spam / Ads
Structured Data
Documentation
Audio Transcript
Tutorial
User Review

The full definitions of the categories can be found in the taxonomy config.

Efficient Inference

We recommend that you use the efficient gte-base-en-v1.5 implementation by enabling unpadding and memory efficient attention. This requires installing xformers (see more here) and loading the model like:

AutoModelForSequenceClassification.from_pretrained(
    "WebOrganizer/FormatClassifier",
    trust_remote_code=True,
    unpad_inputs=True,
    use_memory_efficient_attention=True,
    torch_dtype=torch.bfloat16
)

📚 Documentation

Model Information

Property	Details
Library Name	transformers
Model Type	Fine - tuned gte-base-en-v1.5
Training Data	1. WebOrganizer/FormatAnnotations-Llama-3.1-8B: 1M documents annotated by Llama-3.1-8B (first - stage training) 2. WebOrganizer/FormatAnnotations-Llama-3.1-405B-FP8: 100K documents annotated by Llama-3.1-405B-FP8 (second - stage training)
Base Model	Alibaba-NLP/gte-base-en-v1.5

All Domain Classifiers

WebOrganizer/FormatClassifier ← you are here!
WebOrganizer/FormatClassifier-NoURL
WebOrganizer/TopicClassifier
WebOrganizer/TopicClassifier-NoURL

📄 License

The license information is not provided in the original README.

📖 Citation

@article{wettig2025organize,
  title={Organize the Web: Constructing Domains Enhances Pre-Training Data Curation},
  author={Alexander Wettig and Kyle Lo and Sewon Min and Hannaneh Hajishirzi and Danqi Chen and Luca Soldaini},
  journal={arXiv preprint arXiv:2502.10341},
  year={2025}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご