Scandi - Fine - Web - Cleaner Open Source Classifier: Accurately Identify Problematic Content in Danish and Swedish Web Page Texts

Scandi Fine Web Cleaner

Developed by davanstrien

This model is a demonstration classifier designed to identify problematic content (wrong language, garbled text) in Danish and Swedish web pages.

Text Classification

Transformers

OtherOpen Source License:MIT #Nordic web text filtering #High-precision classification #Multilingual error detection

Downloads 42

Release Time : 1/10/2025

Model Overview

Developed by fine-tuning XLM-RoBERTa-base on the FineWeb-c dataset, this model serves as a preliminary filter for web text to improve annotation efficiency.

Model Features

High precision

Achieves 95.2% precision, meaning fewer false positives

Bilingual support

Specifically optimized for Danish and Swedish content

Web text filtering

Designed as a preliminary filter to enhance web data annotation efficiency

Model Capabilities

Identify wrong language content

Detect garbled text

Web text classification

Use Cases

Data cleaning

Web data preprocessing

Filter low-quality content before data annotation

Improves annotation efficiency and quality

🚀 scandi-fine-web-cleaner

This model serves as a demo classifier for identifying problematic content (such as incorrect language and garbled text) in Danish and Swedish web text. It helps improve annotation efficiency by filtering web data.

This model is a demo classifier for identifying problematic content (incorrect language, garbled text) in Danish and Swedish web text. It was created as part of a blog post exploring how to filter web data using community annotations. The model was created by fine-tuning FacebookAI/xlm-roberta-base on the data-is-better-together/fineweb-c dataset.

It achieves the following results on the evaluation set:

Precision: 0.9524 (95.2%)
Recall: 0.7018 (70.2%)
F1: 0.8081
AUC-ROC: 0.9648

🚀 Quick Start

This model can be directly used for preliminary filtering of Danish and Swedish web text to improve annotation efficiency. You can fine - tune it on the data-is-better-together/fineweb-c dataset.

✨ Features

High Precision: With a precision of 95.2%, false positives are rare.
Good Recall: It can catch most problematic content with a recall of 70.2%.
Multi - language Support: Only tested on Danish and Swedish content for now.

📚 Documentation

Intended uses & limitations

The model is intended to be used as a preliminary filter for web text to help improve annotation efficiency. It has only been tested on Danish and Swedish content. The high precision (95.2%) means false positives are rare, while the recall (70.2%) indicates it catches most problematic content.

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 2e - 05
train_batch_size: 16
eval_batch_size: 16
seed: 42
optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon = 1e - 08 and optimizer_args = No additional optimizer arguments
lr_scheduler_type: linear
num_epochs: 10
mixed_precision_training: Native AMP

Training results

Training Loss	Epoch	Step	Validation Loss	Precision	Recall	F1	Auc Roc	Balanced Accuracy	Average Precision
0.3165	1.0	100	0.2333	0.95	0.6667	0.7835	0.8099	0.8304	0.7721
0.1929	2.0	200	0.1359	0.9130	0.7368	0.8155	0.9778	0.8626	0.9105
0.1775	3.0	300	0.2245	0.9268	0.6667	0.7755	0.9481	0.8290	0.8721
0.1553	4.0	400	0.1816	0.9524	0.7018	0.8081	0.9648	0.8480	0.8906

Framework versions

Transformers 4.48.0
Pytorch 2.5.1+cu124
Datasets 3.2.0
Tokenizers 0.21.0

📄 License

This project is licensed under the MIT license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご