๐ FineWeb-Edu FastText classifier
This is a FastText classifier designed to assess the educational value of web pages. It leverages the training data fineweb-edu-llama3-annotations to make accurate judgments. The classifier aims to optimize throughput and compare with transformer-based models.
๐ Quick Start
The FineWeb-Edu FastText classifier can classify more than 2000 examples per second on a CPU, making it suitable for processing large datasets during pretraining. It also provides an interesting comparison with the original model HuggingFaceFW/fineweb-edu-classifier.
โจ Features
- โก Throughput Optimisation: Capable of classifying over 2000 examples per second on a CPU, enabling real-time processing of large datasets during pretraining.
- ๐งช FastText vs Transformer Based Model: Compares the performance of this lightweight model with limited capacity to the original transformer-based model.
๐ฆ Installation
No specific installation steps are provided in the original README. However, the usage code assumes that fasttext
and huggingface_hub
are installed. You can install them using pip
:
pip install fasttext huggingface_hub
๐ป Usage Examples
Basic Usage
from typing import List
import re
from huggingface_hub import hf_hub_download
import fasttext
model_hf = fasttext.load_model(hf_hub_download("kenhktsui/fineweb-edu-fasttext-classifier", "model.bin"))
def replace_newlines(text: str) -> str:
return re.sub("\n+", " ", text)
def predict(text_list: List[str]) -> List[dict]:
text_list = [replace_newlines(text) for text in text_list]
pred = model_hf.predict(text_list)
return [{"label": int(l[0].lstrip("__label__")), "score": s[0]}
for l, s in zip(*pred)]
predict(["Hi"])
๐ Documentation
Model Summary
This FastText classifier is based on the training data fineweb-edu-llama3-annotations. It has two main objectives: throughput optimization and comparison with a transformer-based model. The approach is inspired by an independently developed educational classifier.
Evaluation
The last 46867 samples are used as test data. The classification report and confusion matrix are provided to evaluate the model's performance. The model has an accuracy of 68% and shows a conservative tendency, which is beneficial for filtering large amounts of data. The Spearman rank-order correlation coefficient indicates a moderately strong monotonic relationship with the original model.
Classification Report
precision recall f1-score support
0 0.72 0.44 0.55 5704
1 0.73 0.87 0.80 26595
2 0.52 0.49 0.50 10350
3 0.48 0.33 0.39 3397
4 0.69 0.03 0.06 819
5 0.00 0.00 0.00 2
accuracy 0.68 46867
macro avg 0.52 0.36 0.38 46867
weighted avg 0.67 0.68 0.66 46867
Comparison with Transformer Based Model
Confusion Matrix
[ 2537 3098 65 4 0 0]
[ 944 23037 2491 123 0 0]
y_true [ 26 4742 5048 533 1 0]
[ 4 434 1846 1105 8 0]
[ 0 38 213 544 24 0]
[ 0 0 0 0 2 0]
y_pred
Rating Frequency
Predicted - Actual Rating |
Frequency |
% |
0 |
31751 |
67.7% |
-1 |
8078 |
17.2% |
+1 |
6130 |
13.1% |
-2 |
673 |
1.4% |
+2 |
189 |
0.4% |
-3 |
42 |
0.1% |
+3 |
4 |
0.0% |
The Spearman rank-order correlation coefficient is 0.5881 in the MiniPile train split and 0.5832 in the test split, indicating a moderately strong monotonic relationship in over 1 million representative documents in web data.
๐ License
The license for this project is odc-by
.