fineweb-edu-fasttext-classifier Open-source Classifier - Quickly Evaluate the Educational Value of Web Pages and Optimize CPU Processing

Fineweb Edu Fasttext Classifier

Developed by kenhktsui

A lightweight FastText-based classifier for evaluating the educational value of web content, optimized for CPU processing speed

Text Classification English#Educational Value Assessment #High-Speed Text Classification #Lightweight Model

Downloads 20

Release Time : 6/6/2024

Model Overview

This model is designed for classifying the educational value of web content, specifically optimized for processing speed on CPUs, making it suitable for large-scale data filtering. Compared to Transformer-based models, it performs similarly in certain categories while being more lightweight.

Model Features

High-Performance Processing

Capable of processing over 2000 samples per second on CPUs, ideal for large-scale data filtering

Lightweight Alternative

Serves as a lightweight alternative to Transformer models while maintaining comparable performance on basic classification tasks

Conservative Evaluation Strategy

Tends to underestimate rather than overestimate educational value, making it suitable for pretraining data filtering

Model Capabilities

Text Classification

Educational Value Assessment

Large-Scale Data Processing

Use Cases

Educational Data Filtering

Pretraining Data Screening

Filters low educational value content before LLM pretraining

Accurately identifies 67.7% of samples, with conservative filtering reducing the misdeletion of high-quality data

Educational Resource Evaluation

Automatically assesses the educational value level of web content

Performs comparably to Transformer models in basic categories (levels 0-2)

🚀 FineWeb-Edu FastText classifier

This is a FastText classifier designed to assess the educational value of web pages. It leverages the training data fineweb-edu-llama3-annotations to make accurate judgments. The classifier aims to optimize throughput and compare with transformer-based models.

🚀 Quick Start

The FineWeb-Edu FastText classifier can classify more than 2000 examples per second on a CPU, making it suitable for processing large datasets during pretraining. It also provides an interesting comparison with the original model HuggingFaceFW/fineweb-edu-classifier.

✨ Features

⚡ Throughput Optimisation: Capable of classifying over 2000 examples per second on a CPU, enabling real-time processing of large datasets during pretraining.
🧪 FastText vs Transformer Based Model: Compares the performance of this lightweight model with limited capacity to the original transformer-based model.

📦 Installation

No specific installation steps are provided in the original README. However, the usage code assumes that fasttext and huggingface_hub are installed. You can install them using pip:

pip install fasttext huggingface_hub

💻 Usage Examples

Basic Usage

from typing import List
import re
from huggingface_hub import hf_hub_download
import fasttext


model_hf = fasttext.load_model(hf_hub_download("kenhktsui/fineweb-edu-fasttext-classifier", "model.bin"))


def replace_newlines(text: str) -> str:
  return re.sub("\n+", " ", text)


def predict(text_list: List[str]) -> List[dict]:
  text_list = [replace_newlines(text) for text in text_list]
  pred = model_hf.predict(text_list)
  return [{"label": int(l[0].lstrip("__label__")), "score": s[0]}
           for l, s in zip(*pred)]


predict(["Hi"])
# Output: [{'label': 0, 'score': 1.00001}]

📚 Documentation

Model Summary

This FastText classifier is based on the training data fineweb-edu-llama3-annotations. It has two main objectives: throughput optimization and comparison with a transformer-based model. The approach is inspired by an independently developed educational classifier.

Evaluation

The last 46867 samples are used as test data. The classification report and confusion matrix are provided to evaluate the model's performance. The model has an accuracy of 68% and shows a conservative tendency, which is beneficial for filtering large amounts of data. The Spearman rank-order correlation coefficient indicates a moderately strong monotonic relationship with the original model.

Classification Report

              precision    recall  f1-score   support

           0       0.72      0.44      0.55      5704
           1       0.73      0.87      0.80     26595
           2       0.52      0.49      0.50     10350
           3       0.48      0.33      0.39      3397
           4       0.69      0.03      0.06       819
           5       0.00      0.00      0.00         2

    accuracy                           0.68     46867
   macro avg       0.52      0.36      0.38     46867
weighted avg       0.67      0.68      0.66     46867

Comparison with Transformer Based Model

Label	This Model	HuggingFaceFW/fineweb-edu-classifier
0	0.55	0.59
1	0.80	0.81
2	0.50	0.59
3	0.39	0.53
4	0.06	0.44
5	0.00	0.02

Confusion Matrix

       [ 2537  3098    65     4     0     0]
       [  944 23037  2491   123     0     0]
y_true [   26  4742  5048   533     1     0]
       [    4   434  1846  1105     8     0]
       [    0    38   213   544    24     0]
       [    0     0     0     0     2     0]
                       y_pred

Rating Frequency

Predicted - Actual Rating	Frequency	%
0	31751	67.7%
-1	8078	17.2%
+1	6130	13.1%
-2	673	1.4%
+2	189	0.4%
-3	42	0.1%
+3	4	0.0%

Alignment with HuggingFaceFW/fineweb-edu-classifier

The Spearman rank-order correlation coefficient is 0.5881 in the MiniPile train split and 0.5832 in the test split, indicating a moderately strong monotonic relationship in over 1 million representative documents in web data.

📄 License

The license for this project is odc-by.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご