Distilbert Reuters21578
A DistilBERT-based Reuters-21578 multi-label news classification model, fine-tuned on the ModApte configured dataset, suitable for English news topic classification.
Downloads 30
Release Time : 12/17/2023
Model Overview
This model is a DistilBERT variant fine-tuned on the Reuters-21578 dataset, specifically designed for multi-label text classification tasks, capable of identifying multiple relevant topics in news articles.
Model Features
Efficient and Lightweight
Based on the DistilBERT architecture, it significantly reduces model size while maintaining high performance.
Multi-label Classification
Supports predicting multiple relevant topic labels for news articles simultaneously.
Precision Priority
The model prioritizes precision over recall, making it suitable for applications requiring high accuracy.
Model Capabilities
English News Classification
Multi-label Prediction
Topic Identification
Use Cases
News Analysis
News Topic Tagging
Automatically tag news articles with relevant topic labels
Achieved an F1 score of 0.86 on the Reuters-21578 test set
Content Classification System
Build an automatic classification module for news content management systems
đ distilbert-finetuned-reuters21578-multilabel
This is a fine - tuned text - classification model based on DistilBERT, trained on the Reuters21578 dataset for multi - label news classification.
đ Quick Start
This model is a fine - tuned version of [distilbert - base - cased](https://huggingface.co/distilbert - base - cased) on the reuters21578
dataset. You can use it for text classification tasks as shown in the inference example below.
đģ Usage Examples
Basic Usage
from transformers import pipeline
pipe = pipeline("text - classification", model="lxyuan/distilbert - finetuned - reuters21578 - multilabel", return_all_scores=True)
# dataset["test"]["text"][2]
news_article = (
"JAPAN TO REVISE LONG - TERM ENERGY DEMAND DOWNWARDS The Ministry of International Trade and "
"Industry (MITI) will revise its long - term energy supply/demand "
"outlook by August to meet a forecast downtrend in Japanese "
"energy demand, ministry officials said. "
"MITI is expected to lower the projection for primary energy "
"supplies in the year 2000 to 550 mln kilolitres (kl) from 600 "
"mln, they said. "
"The decision follows the emergence of structural changes in "
"Japanese industry following the rise in the value of the yen "
"and a decline in domestic electric power demand. "
"MITI is planning to work out a revised energy supply/demand "
"outlook through deliberations of committee meetings of the "
"Agency of Natural Resources and Energy, the officials said. "
"They said MITI will also review the breakdown of energy "
"supply sources, including oil, nuclear, coal and natural gas. "
"Nuclear energy provided the bulk of Japan's electric power "
"in the fiscal year ended March 31, supplying an estimated 27 "
"pct on a kilowatt/hour basis, followed by oil (23 pct) and "
"liquefied natural gas (21 pct), they noted. "
"REUTER"
)
# dataset["test"]["topics"][2]
target_topics = ['crude', 'nat - gas']
fn_kwargs={"padding": "max_length", "truncation": True, "max_length": 512}
output = pipe(example, function_to_apply="sigmoid", **fn_kwargs)
for item in output[0]:
if item["score"]>=0.5:
print(item["label"], item["score"])
>>> crude 0.7355073690414429
nat - gas 0.8600426316261292
đ Documentation
Overall Summary and Comparison Table
Property | Baseline (Scikit - learn) | Transformer Model |
---|---|---|
Micro - Averaged F1 | 0.77 | 0.86 |
Macro - Averaged F1 | 0.29 | 0.33 |
Weighted Average F1 | 0.70 | 0.84 |
Samples Average F1 | 0.75 | 0.80 |
â ī¸ Important Note
- Precision vs Recall: Both models prioritize high precision over recall. In our client - facing news classification model, precision takes precedence over recall because the repercussions of false positives are more severe and harder to justify to clients compared to false negatives.
- Class Imbalance Handling: Both models suffer from poor performance on minority classes, as reflected in the low macro - averaged F1 - scores. However, the transformer model shows a slight improvement in macro - averaged F1 - score (0.33 vs 0.29).
- Issue of Zero Support Labels: Both models have the problem of zero support for several labels, which can skew the performance metrics and may suggest that the models are not well - tuned for minority classes or the dataset lacks sufficient examples of these classes.
- General Performance: The transformer model surpasses the scikit - learn baseline in terms of weighted and samples average F1 - scores, indicating better overall performance and better handling of label imbalance.
- Conclusion: While both models exhibit high precision, the transformer model slightly outperforms the scikit - learn baseline model in all metrics considered. It provides a better trade - off between precision and recall and some improvement in handling minority classes.
Training and evaluation data
We remove single - appearance labels from both training and test sets using the following code:
# Find Single Appearance Labels
def find_single_appearance_labels(y):
"""Find labels that appear only once in the dataset."""
all_labels = list(chain.from_iterable(y))
label_count = Counter(all_labels)
single_appearance_labels = [label for label, count in label_count.items() if count == 1]
return single_appearance_labels
# Remove Single Appearance Labels from Dataset
def remove_single_appearance_labels(dataset, single_appearance_labels):
"""Remove samples with single - appearance labels from both train and test sets."""
for split in ['train', 'test']:
dataset[split] = dataset[split].filter(lambda x: all(label not in single_appearance_labels for label in x['topics']))
return dataset
dataset = load_dataset("reuters21578", "ModApte")
# Find and Remove Single Appearance Labels
y_train = [item['topics'] for item in dataset['train']]
single_appearance_labels = find_single_appearance_labels(y_train)
print(f"Single appearance labels: {single_appearance_labels}")
>>> Single appearance labels: ['lin - oil', 'rye', 'red - bean', 'groundnut - oil', 'citruspulp', 'rape - meal', 'corn - oil', 'peseta', 'cotton - oil', 'ringgit', 'castorseed', 'castor - oil', 'lit', 'rupiah', 'skr', 'nkr', 'dkr', 'sun - meal', 'lin - meal', 'cruzado']
print("Removing samples with single - appearance labels...")
dataset = remove_single_appearance_labels(dataset, single_appearance_labels)
unique_labels = set(chain.from_iterable(dataset['train']["topics"]))
print(f"We have {len(unique_labels)} unique labels:\n{unique_labels}")
>>> We have 95 unique labels:
{'veg - oil', 'gold', 'platinum', 'ipi', 'acq', 'carcass', 'wool', 'coconut - oil', 'linseed', 'copper', 'soy - meal', 'jet', 'dlr', 'copra - cake', 'hog', 'rand', 'strategic - metal', 'can', 'tea', 'sorghum', 'livestock', 'barley', 'lumber', 'earn', 'wheat', 'trade', 'soy - oil', 'cocoa', 'inventories', 'income', 'rubber', 'tin', 'iron - steel', 'ship', 'rapeseed', 'wpi', 'sun - oil', 'pet - chem', 'palmkernel', 'nat - gas', 'gnp', 'l - cattle', 'propane', 'rice', 'lead', 'alum', 'instal - debt', 'saudriyal', 'cpu', 'jobs', 'meal - feed', 'oilseed', 'dmk', 'plywood', 'zinc', 'retail', 'dfl', 'cpi', 'crude', 'pork - belly', 'gas', 'money - fx', 'corn', 'tapioca', 'palladium', 'lei', 'cornglutenfeed', 'sunseed', 'potato', 'silver', 'sugar', 'grain', 'groundnut', 'naphtha', 'orange', 'soybean', 'coconut', 'stg', 'cotton', 'yen', 'rape - oil', 'palm - oil', 'oat', 'reserves', 'housing', 'interest', 'coffee', 'fuel', 'austdlr', 'money - supply', 'heat', 'fishmeal', 'bop', 'nickel', 'nzdlr'}
Training procedure
- EDA on Reuters - 21578 dataset: This notebook provides an Exploratory Data Analysis (EDA) of the Reuters - 21578 dataset, including visualizations and statistical summaries.
- Reuters Baseline Scikit - Learn Model: This notebook establishes a baseline model for text classification on the Reuters - 21578 dataset using scikit - learn.
- Reuters Transformer Model: This notebook delves into advanced text classification using a Transformer model on the Reuters - 21578 dataset.
- Multilabel Stratified Sampling & Hypyerparameter Search on Reuters Dataset: This notebook explores advanced machine learning techniques through the Hugging Face Trainer API, including Multilabel Iterative Stratified Splitting and Hyperparameter Search.
Evaluation results
Transformer Model Evaluation Result
Classification Report: precision recall f1 - score support
acq 0.97 0.93 0.95 719
alum 1.00 0.70 0.82 23
austdlr 0.00 0.00 0.00 0
barley 1.00 0.50 0.67 12
bop 0.79 0.50 0.61 30
can 0.00 0.00 0.00 0
carcass 0.67 0.67 0.67 18
cocoa 1.00 1.00 1.00 18
coconut 0.00 0.00 0.00 2
coconut - oil 0.00 0.00 0.00 2
coffee 0.86 0.89 0.87 27
copper 1.00 0.78 0.88 18
copra - cake 0.00 0.00 0.00 1
corn 0.84 0.87 0.86 55
cornglutenfeed 0.00 0.00 0.00 0
cotton 0.92 0.67 0.77 18
cpi 0.86 0.43 0.57 28
cpu 0.00 0.00 0.00 1
crude 0.87 0.93 0.90 189
dfl 0.00
đ§ Technical Details
Model Information
Property | Details |
---|---|
Model Type | distilbert - finetuned - reuters21578 - multilabel |
Training Data | reuters21578 |
Metrics | F1, Accuracy |
đ License
This model is licensed under the Apache - 2.0 license.
Distilbert Base Uncased Finetuned Sst 2 English
Apache-2.0
Text classification model fine-tuned on the SST-2 sentiment analysis dataset based on DistilBERT-base-uncased, with 91.3% accuracy
Text Classification English
D
distilbert
5.2M
746
Xlm Roberta Base Language Detection
MIT
Multilingual detection model based on XLM-RoBERTa, supporting text classification in 20 languages
Text Classification
Transformers Supports Multiple Languages

X
papluca
2.7M
333
Roberta Hate Speech Dynabench R4 Target
This model improves online hate detection through dynamic dataset generation, focusing on learning from worst-case scenarios to enhance detection effectiveness.
Text Classification
Transformers English

R
facebook
2.0M
80
Bert Base Multilingual Uncased Sentiment
MIT
A multilingual sentiment analysis model fine-tuned based on bert-base-multilingual-uncased, supporting sentiment analysis of product reviews in 6 languages
Text Classification Supports Multiple Languages
B
nlptown
1.8M
371
Emotion English Distilroberta Base
A fine-tuned English text emotion classification model based on DistilRoBERTa-base, capable of predicting Ekman's six basic emotions and neutral category.
Text Classification
Transformers English

E
j-hartmann
1.1M
402
Robertuito Sentiment Analysis
Spanish tweet sentiment analysis model based on RoBERTuito, supporting POS(positive)/NEG(negative)/NEU(neutral) three-class sentiment classification
Text Classification Spanish
R
pysentimiento
1.0M
88
Finbert Tone
FinBERT is a BERT model pre-trained on financial communication texts, specializing in the field of financial natural language processing. finbert-tone is its fine-tuned version for financial sentiment analysis tasks.
Text Classification
Transformers English

F
yiyanghkust
998.46k
178
Roberta Base Go Emotions
MIT
A multi-label sentiment classification model based on RoBERTa-base, trained on the go_emotions dataset, supporting recognition of 28 emotion labels.
Text Classification
Transformers English

R
SamLowe
848.12k
565
Xlm Emo T
XLM-EMO is a multilingual sentiment analysis model fine-tuned based on the XLM-T model, supporting 19 languages and specifically designed for sentiment prediction in social media texts.
Text Classification
Transformers Other

X
MilaNLProc
692.30k
7
Deberta V3 Base Mnli Fever Anli
MIT
DeBERTa-v3 model trained on MultiNLI, Fever-NLI, and ANLI datasets, excelling in zero-shot classification and natural language inference tasks
Text Classification
Transformers English

D
MoritzLaurer
613.93k
204
Featured Recommended AI Models
Š 2025AIbase