The open-source model roberta-base_topic_classification_nyt_news - Accurately classify New York Times news topics

Roberta Base Topic Classification Nyt News

Developed by dstefa

A news topic classification model fine-tuned based on roberta-base, trained on the New York Times news dataset with an accuracy of 0.91.

Text Classification

Transformers

Open Source License:MIT #News Topic Classification #High Accuracy (0.91)#RoBERTa Fine-tuning

Downloads 14.09k

Release Time : 1/16/2024

Model Overview

This model is used for topic classification of news texts, supporting 8 news topic categories, including sports, arts & entertainment, business & finance, etc.

Model Features

High Accuracy

Achieves 0.91 accuracy, F1 score, precision, and recall on the test set.

Wide Topic Coverage

Supports 8 news topic categories, covering sports, arts, business, health, and more.

Optimized Based on RoBERTa

Fine-tuned on the powerful roberta-base model, with excellent text comprehension capabilities.

Model Capabilities

News Topic Classification

Text Classification

Multi-category Prediction

Use Cases

News Media

Automatic News Classification

Automatically assigns topic categories to news articles, improving content management efficiency.

Accuracy reaches 91%

Content Recommendation System

Recommends relevant news content to users based on topic classification results.

Data Analysis

News Trend Analysis

Analyzes the temporal distribution and trends of news across different topics through classification results.

🚀 roberta-base_topic_classification_nyt_news

This model is a fine - tuned version of [roberta - base](https://huggingface.co/roberta - base) on the NYT News dataset, which contains 256,000 news titles from articles published from 2000 to the present (https://www.kaggle.com/datasets/aryansingh0909/nyt - articles - 21m - 2000 - present). It can effectively classify news topics, providing high - accuracy classification results for news data.

🚀 Quick Start

Using the Model with HuggingFace

from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformers import pipeline

tokenizer = AutoTokenizer.from_pretrained("dstefa/roberta-base_topic_classification_nyt_news")
model = AutoModelForSequenceClassification.from_pretrained("dstefa/roberta-base_topic_classification_nyt_news")
pipe = pipeline("text-classification", model=model, tokenizer=tokenizer)

text = "Kederis proclaims innocence Olympic champion Kostas Kederis today left hospital ahead of his date with IOC inquisitors claiming his innocence and vowing."
pipe(text)

[{'label': 'Sports', 'score': 0.9989326596260071}]

✨ Features

High - accuracy Classification: Achieves high scores in accuracy, F1, precision, and recall on the test set.
Multiple Topic Coverage: Can classify news into multiple topics such as sports, arts, business, health, etc.

📦 Installation

This section does not have specific installation commands in the original text, so it is skipped.

💻 Usage Examples

Basic Usage

from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformers import pipeline

tokenizer = AutoTokenizer.from_pretrained("dstefa/roberta-base_topic_classification_nyt_news")
model = AutoModelForSequenceClassification.from_pretrained("dstefa/roberta-base_topic_classification_nyt_news")
pipe = pipeline("text-classification", model=model, tokenizer=tokenizer)

text = "Kederis proclaims innocence Olympic champion Kostas Kederis today left hospital ahead of his date with IOC inquisitors claiming his innocence and vowing."
result = pipe(text)
print(result)

Advanced Usage

# Classify multiple texts at once
texts = [
    "Olympic champion Kostas Kederis today left hospital ahead of his date with IOC inquisitors claiming his innocence and vowing.",
    "Although many individuals are doing fever checks to screen for Covid - 19, many Covid - 19 patients never have a fever.",
    "Twelve myths about Russia's War in Ukraine exposed"
]
results = pipe(texts)
for res in results:
    print(res)

📚 Documentation

Model Information

Property	Details
Model Type	Fine - tuned version of roberta - base for text classification
Training Data	NYT News dataset (https://www.kaggle.com/datasets/aryansingh0909/nyt - articles - 21m - 2000 - present)

Training Data Classification

Class	Description
0	Sports
1	Arts, Culture, and Entertainment
2	Business and Finance
3	Health and Wellness
4	Lifestyle and Fashion
5	Science and Technology
6	Politics
7	Crime

Training Procedure

Training Hyperparameters

The following hyperparameters were used during training:

learning_rate: 5e - 05
train_batch_size: 8
eval_batch_size: 8
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon = 1e - 08
lr_scheduler_type: linear
lr_scheduler_warmup_steps: 500
num_epochs: 5

Training Results

Training Loss	Epoch	Step	Validation Loss	Accuracy	F1	Precision	Recall
0.3192	1.0	20480	0.4078	0.8865	0.8859	0.8892	0.8865
0.2863	2.0	40960	0.4271	0.8972	0.8970	0.8982	0.8972
0.1979	3.0	61440	0.3797	0.9094	0.9092	0.9098	0.9094
0.1239	4.0	81920	0.3981	0.9117	0.9113	0.9114	0.9117
0.1472	5.0	102400	0.4033	0.9137	0.9135	0.9134	0.9137

Model Performance

Topic	Precision	Recall	F1	Support
Sports	0.97	0.98	0.97	6400
Arts, Culture, and Entertainment	0.94	0.95	0.94	6400
Business and Finance	0.85	0.84	0.84	6400
Health and Wellness	0.90	0.93	0.91	6400
Lifestyle and Fashion	0.95	0.95	0.95	6400
Science and Technology	0.89	0.83	0.86	6400
Politics	0.93	0.88	0.90	6400
Crime	0.85	0.93	0.89	6400

Accuracy			0.91	51200
Macro Avg	0.91	0.91	0.91	51200
Weighted Avg	0.91	0.91	0.91	51200

Framework Versions

Transformers 4.32.1
Pytorch 2.1.0+cu121
Datasets 2.12.0
Tokenizers 0.13.2

🔧 Technical Details

The model is fine - tuned based on the roberta - base model. During the training process, appropriate hyperparameters are selected, and the Adam optimizer is used for optimization. The linear learning rate scheduler is used to adjust the learning rate, which helps the model converge better and achieve high - accuracy classification results on the news dataset.

📄 License

This project is licensed under the MIT license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご