DistilBERT-Reuters21578 Open-Source News Classification Model - Free Deployment for English News Topic Classification

Distilbert Reuters21578

Developed by tarekziade

A DistilBERT-based Reuters-21578 multi-label news classification model, fine-tuned on the ModApte configured dataset, suitable for English news topic classification.

Text Classification

Transformers

EnglishOpen Source License:Apache-2.0 #News Multi-label Classification #High Precision Priority #Reuters Dataset

Downloads 30

Release Time : 12/17/2023

Model Overview

This model is a DistilBERT variant fine-tuned on the Reuters-21578 dataset, specifically designed for multi-label text classification tasks, capable of identifying multiple relevant topics in news articles.

Model Features

Efficient and Lightweight

Based on the DistilBERT architecture, it significantly reduces model size while maintaining high performance.

Multi-label Classification

Supports predicting multiple relevant topic labels for news articles simultaneously.

Precision Priority

The model prioritizes precision over recall, making it suitable for applications requiring high accuracy.

Model Capabilities

English News Classification

Multi-label Prediction

Topic Identification

Use Cases

News Analysis

News Topic Tagging

Automatically tag news articles with relevant topic labels

Achieved an F1 score of 0.86 on the Reuters-21578 test set

Content Classification System

Build an automatic classification module for news content management systems

🚀 distilbert-finetuned-reuters21578-multilabel

This is a fine - tuned text - classification model based on DistilBERT, trained on the Reuters21578 dataset for multi - label news classification.

🚀 Quick Start

This model is a fine - tuned version of [distilbert - base - cased](https://huggingface.co/distilbert - base - cased) on the reuters21578 dataset. You can use it for text classification tasks as shown in the inference example below.

💻 Usage Examples

Basic Usage

from transformers import pipeline

pipe = pipeline("text - classification", model="lxyuan/distilbert - finetuned - reuters21578 - multilabel", return_all_scores=True)

# dataset["test"]["text"][2]
news_article = (
    "JAPAN TO REVISE LONG - TERM ENERGY DEMAND DOWNWARDS The Ministry of International Trade and "
    "Industry (MITI) will revise its long - term energy supply/demand "
    "outlook by August to meet a forecast downtrend in Japanese "
    "energy demand, ministry officials said. "
    "MITI is expected to lower the projection for primary energy "
    "supplies in the year 2000 to 550 mln kilolitres (kl) from 600 "
    "mln, they said. "
    "The decision follows the emergence of structural changes in "
    "Japanese industry following the rise in the value of the yen "
    "and a decline in domestic electric power demand. "
    "MITI is planning to work out a revised energy supply/demand "
    "outlook through deliberations of committee meetings of the "
    "Agency of Natural Resources and Energy, the officials said. "
    "They said MITI will also review the breakdown of energy "
    "supply sources, including oil, nuclear, coal and natural gas. "
    "Nuclear energy provided the bulk of Japan's electric power "
    "in the fiscal year ended March 31, supplying an estimated 27 "
    "pct on a kilowatt/hour basis, followed by oil (23 pct) and "
    "liquefied natural gas (21 pct), they noted. "
    "REUTER"
)

# dataset["test"]["topics"][2]
target_topics = ['crude', 'nat - gas']

fn_kwargs={"padding": "max_length", "truncation": True, "max_length": 512}
output = pipe(example, function_to_apply="sigmoid", **fn_kwargs)

for item in output[0]:
    if item["score"]>=0.5:
        print(item["label"], item["score"])

>>> crude 0.7355073690414429
nat - gas 0.8600426316261292

📚 Documentation

Overall Summary and Comparison Table

Property	Baseline (Scikit - learn)	Transformer Model
Micro - Averaged F1	0.77	0.86
Macro - Averaged F1	0.29	0.33
Weighted Average F1	0.70	0.84
Samples Average F1	0.75	0.80

⚠️ Important Note

Precision vs Recall: Both models prioritize high precision over recall. In our client - facing news classification model, precision takes precedence over recall because the repercussions of false positives are more severe and harder to justify to clients compared to false negatives.

Class Imbalance Handling: Both models suffer from poor performance on minority classes, as reflected in the low macro - averaged F1 - scores. However, the transformer model shows a slight improvement in macro - averaged F1 - score (0.33 vs 0.29).

Issue of Zero Support Labels: Both models have the problem of zero support for several labels, which can skew the performance metrics and may suggest that the models are not well - tuned for minority classes or the dataset lacks sufficient examples of these classes.

General Performance: The transformer model surpasses the scikit - learn baseline in terms of weighted and samples average F1 - scores, indicating better overall performance and better handling of label imbalance.

Conclusion: While both models exhibit high precision, the transformer model slightly outperforms the scikit - learn baseline model in all metrics considered. It provides a better trade - off between precision and recall and some improvement in handling minority classes.

Training and evaluation data

We remove single - appearance labels from both training and test sets using the following code:

# Find Single Appearance Labels
def find_single_appearance_labels(y):
    """Find labels that appear only once in the dataset."""
    all_labels = list(chain.from_iterable(y))
    label_count = Counter(all_labels)
    single_appearance_labels = [label for label, count in label_count.items() if count == 1]
    return single_appearance_labels

# Remove Single Appearance Labels from Dataset
def remove_single_appearance_labels(dataset, single_appearance_labels):
    """Remove samples with single - appearance labels from both train and test sets."""
    for split in ['train', 'test']:
        dataset[split] = dataset[split].filter(lambda x: all(label not in single_appearance_labels for label in x['topics']))
    return dataset

dataset = load_dataset("reuters21578", "ModApte")

# Find and Remove Single Appearance Labels
y_train = [item['topics'] for item in dataset['train']]
single_appearance_labels = find_single_appearance_labels(y_train)
print(f"Single appearance labels: {single_appearance_labels}")
>>> Single appearance labels: ['lin - oil', 'rye', 'red - bean', 'groundnut - oil', 'citruspulp', 'rape - meal', 'corn - oil', 'peseta', 'cotton - oil', 'ringgit', 'castorseed', 'castor - oil', 'lit', 'rupiah', 'skr', 'nkr', 'dkr', 'sun - meal', 'lin - meal', 'cruzado']

print("Removing samples with single - appearance labels...")
dataset = remove_single_appearance_labels(dataset, single_appearance_labels)

unique_labels = set(chain.from_iterable(dataset['train']["topics"]))
print(f"We have {len(unique_labels)} unique labels:\n{unique_labels}")
>>> We have 95 unique labels:
{'veg - oil', 'gold', 'platinum', 'ipi', 'acq', 'carcass', 'wool', 'coconut - oil', 'linseed', 'copper', 'soy - meal', 'jet', 'dlr', 'copra - cake', 'hog', 'rand', 'strategic - metal', 'can', 'tea', 'sorghum', 'livestock', 'barley', 'lumber', 'earn', 'wheat', 'trade', 'soy - oil', 'cocoa', 'inventories', 'income', 'rubber', 'tin', 'iron - steel', 'ship', 'rapeseed', 'wpi', 'sun - oil', 'pet - chem', 'palmkernel', 'nat - gas', 'gnp', 'l - cattle', 'propane', 'rice', 'lead', 'alum', 'instal - debt', 'saudriyal', 'cpu', 'jobs', 'meal - feed', 'oilseed', 'dmk', 'plywood', 'zinc', 'retail', 'dfl', 'cpi', 'crude', 'pork - belly', 'gas', 'money - fx', 'corn', 'tapioca', 'palladium', 'lei', 'cornglutenfeed', 'sunseed', 'potato', 'silver', 'sugar', 'grain', 'groundnut', 'naphtha', 'orange', 'soybean', 'coconut', 'stg', 'cotton', 'yen', 'rape - oil', 'palm - oil', 'oat', 'reserves', 'housing', 'interest', 'coffee', 'fuel', 'austdlr', 'money - supply', 'heat', 'fishmeal', 'bop', 'nickel', 'nzdlr'}

Training procedure

EDA on Reuters - 21578 dataset: This notebook provides an Exploratory Data Analysis (EDA) of the Reuters - 21578 dataset, including visualizations and statistical summaries.
Reuters Baseline Scikit - Learn Model: This notebook establishes a baseline model for text classification on the Reuters - 21578 dataset using scikit - learn.
Reuters Transformer Model: This notebook delves into advanced text classification using a Transformer model on the Reuters - 21578 dataset.
Multilabel Stratified Sampling & Hypyerparameter Search on Reuters Dataset: This notebook explores advanced machine learning techniques through the Hugging Face Trainer API, including Multilabel Iterative Stratified Splitting and Hyperparameter Search.

Evaluation results

Transformer Model Evaluation Result

Classification Report: precision recall f1 - score support

        acq       0.97      0.93      0.95       719
       alum       1.00      0.70      0.82        23
    austdlr       0.00      0.00      0.00         0
     barley       1.00      0.50      0.67        12
        bop       0.79      0.50      0.61        30
        can       0.00      0.00      0.00         0
    carcass       0.67      0.67      0.67        18
      cocoa       1.00      1.00      1.00        18
    coconut       0.00      0.00      0.00         2
coconut - oil       0.00      0.00      0.00         2
     coffee       0.86      0.89      0.87        27
     copper       1.00      0.78      0.88        18
 copra - cake       0.00      0.00      0.00         1
       corn       0.84      0.87      0.86        55
cornglutenfeed       0.00      0.00      0.00         0
     cotton       0.92      0.67      0.77        18
        cpi       0.86      0.43      0.57        28
        cpu       0.00      0.00      0.00         1
      crude       0.87      0.93      0.90       189
        dfl       0.00

🔧 Technical Details

Model Information

Property	Details
Model Type	distilbert - finetuned - reuters21578 - multilabel
Training Data	reuters21578
Metrics	F1, Accuracy

📄 License

This model is licensed under the Apache - 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご