đ WebOrganizer/FormatClassifier
The FormatClassifier is designed to categorize web content into 24 distinct categories. It makes use of both the URL and text content of web pages, offering a comprehensive solution for web content organization. The model is based on gte-base-en-v1.5 with 140M parameters, fine - tuned on specific training datasets.
[Paper] [Website] [GitHub]
⨠Features
- Categorization: Organizes web content into 24 categories using URL and text information.
- Fine - Tuned Model: Based on gte-base-en-v1.5 and fine - tuned on specific datasets.
đĻ Installation
The installation details are not provided in the original README. If you need to install the necessary dependencies for this model, you may need to refer to the official documentation of transformers
library and other related components.
đģ Usage Examples
Basic Usage
This classifier expects input in the following input format:
{url}
{text}
from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained("WebOrganizer/FormatClassifier")
model = AutoModelForSequenceClassification.from_pretrained(
"WebOrganizer/FormatClassifier",
trust_remote_code=True,
use_memory_efficient_attention=False)
web_page = """http://www.example.com
How to make a good sandwich? [Click here to read article]"""
inputs = tokenizer([web_page], return_tensors="pt")
outputs = model(**inputs)
probs = outputs.logits.softmax(dim=-1)
print(probs.argmax(dim=-1))
Advanced Usage
You can convert the logits
of the model with a softmax to obtain a probability distribution over the following 24 categories (in order of labels, also see id2label
and label2id
in the model config):
- Academic Writing
- Content Listing
- Creative Writing
- Customer Support
- Comment Section
- FAQ
- Truncated
- Knowledge Article
- Legal Notices
- Listicle
- News Article
- Nonfiction Writing
- About (Org.)
- News (Org.)
- About (Pers.)
- Personal Blog
- Product Page
- Q&A Forum
- Spam / Ads
- Structured Data
- Documentation
- Audio Transcript
- Tutorial
- User Review
The full definitions of the categories can be found in the taxonomy config.
Efficient Inference
We recommend that you use the efficient gte-base-en-v1.5 implementation by enabling unpadding and memory efficient attention. This requires installing xformers
(see more here) and loading the model like:
AutoModelForSequenceClassification.from_pretrained(
"WebOrganizer/FormatClassifier",
trust_remote_code=True,
unpad_inputs=True,
use_memory_efficient_attention=True,
torch_dtype=torch.bfloat16
)
đ Documentation
Model Information
All Domain Classifiers
đ License
The license information is not provided in the original README.
đ Citation
@article{wettig2025organize,
title={Organize the Web: Constructing Domains Enhances Pre-Training Data Curation},
author={Alexander Wettig and Kyle Lo and Sewon Min and Hannaneh Hajishirzi and Danqi Chen and Luca Soldaini},
journal={arXiv preprint arXiv:2502.10341},
year={2025}
}