đ roberta-base_topic_classification_nyt_news
This model is a fine - tuned version of [roberta - base](https://huggingface.co/roberta - base) on the NYT News dataset, which contains 256,000 news titles from articles published from 2000 to the present (https://www.kaggle.com/datasets/aryansingh0909/nyt - articles - 21m - 2000 - present). It can effectively classify news topics, providing high - accuracy classification results for news data.
đ Quick Start
Using the Model with HuggingFace
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformers import pipeline
tokenizer = AutoTokenizer.from_pretrained("dstefa/roberta-base_topic_classification_nyt_news")
model = AutoModelForSequenceClassification.from_pretrained("dstefa/roberta-base_topic_classification_nyt_news")
pipe = pipeline("text-classification", model=model, tokenizer=tokenizer)
text = "Kederis proclaims innocence Olympic champion Kostas Kederis today left hospital ahead of his date with IOC inquisitors claiming his innocence and vowing."
pipe(text)
[{'label': 'Sports', 'score': 0.9989326596260071}]
⨠Features
- High - accuracy Classification: Achieves high scores in accuracy, F1, precision, and recall on the test set.
- Multiple Topic Coverage: Can classify news into multiple topics such as sports, arts, business, health, etc.
đĻ Installation
This section does not have specific installation commands in the original text, so it is skipped.
đģ Usage Examples
Basic Usage
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformers import pipeline
tokenizer = AutoTokenizer.from_pretrained("dstefa/roberta-base_topic_classification_nyt_news")
model = AutoModelForSequenceClassification.from_pretrained("dstefa/roberta-base_topic_classification_nyt_news")
pipe = pipeline("text-classification", model=model, tokenizer=tokenizer)
text = "Kederis proclaims innocence Olympic champion Kostas Kederis today left hospital ahead of his date with IOC inquisitors claiming his innocence and vowing."
result = pipe(text)
print(result)
Advanced Usage
texts = [
"Olympic champion Kostas Kederis today left hospital ahead of his date with IOC inquisitors claiming his innocence and vowing.",
"Although many individuals are doing fever checks to screen for Covid - 19, many Covid - 19 patients never have a fever.",
"Twelve myths about Russia's War in Ukraine exposed"
]
results = pipe(texts)
for res in results:
print(res)
đ Documentation
Model Information
Property |
Details |
Model Type |
Fine - tuned version of roberta - base for text classification |
Training Data |
NYT News dataset (https://www.kaggle.com/datasets/aryansingh0909/nyt - articles - 21m - 2000 - present) |
Training Data Classification
Class |
Description |
0 |
Sports |
1 |
Arts, Culture, and Entertainment |
2 |
Business and Finance |
3 |
Health and Wellness |
4 |
Lifestyle and Fashion |
5 |
Science and Technology |
6 |
Politics |
7 |
Crime |
Training Procedure
Training Hyperparameters
The following hyperparameters were used during training:
- learning_rate: 5e - 05
- train_batch_size: 8
- eval_batch_size: 8
- seed: 42
- optimizer: Adam with betas=(0.9,0.999) and epsilon = 1e - 08
- lr_scheduler_type: linear
- lr_scheduler_warmup_steps: 500
- num_epochs: 5
Training Results
Training Loss |
Epoch |
Step |
Validation Loss |
Accuracy |
F1 |
Precision |
Recall |
0.3192 |
1.0 |
20480 |
0.4078 |
0.8865 |
0.8859 |
0.8892 |
0.8865 |
0.2863 |
2.0 |
40960 |
0.4271 |
0.8972 |
0.8970 |
0.8982 |
0.8972 |
0.1979 |
3.0 |
61440 |
0.3797 |
0.9094 |
0.9092 |
0.9098 |
0.9094 |
0.1239 |
4.0 |
81920 |
0.3981 |
0.9117 |
0.9113 |
0.9114 |
0.9117 |
0.1472 |
5.0 |
102400 |
0.4033 |
0.9137 |
0.9135 |
0.9134 |
0.9137 |
Model Performance
Topic |
Precision |
Recall |
F1 |
Support |
Sports |
0.97 |
0.98 |
0.97 |
6400 |
Arts, Culture, and Entertainment |
0.94 |
0.95 |
0.94 |
6400 |
Business and Finance |
0.85 |
0.84 |
0.84 |
6400 |
Health and Wellness |
0.90 |
0.93 |
0.91 |
6400 |
Lifestyle and Fashion |
0.95 |
0.95 |
0.95 |
6400 |
Science and Technology |
0.89 |
0.83 |
0.86 |
6400 |
Politics |
0.93 |
0.88 |
0.90 |
6400 |
Crime |
0.85 |
0.93 |
0.89 |
6400 |
|
|
|
|
|
Accuracy |
|
|
0.91 |
51200 |
Macro Avg |
0.91 |
0.91 |
0.91 |
51200 |
Weighted Avg |
0.91 |
0.91 |
0.91 |
51200 |
Framework Versions
- Transformers 4.32.1
- Pytorch 2.1.0+cu121
- Datasets 2.12.0
- Tokenizers 0.13.2
đ§ Technical Details
The model is fine - tuned based on the roberta - base
model. During the training process, appropriate hyperparameters are selected, and the Adam optimizer is used for optimization. The linear learning rate scheduler is used to adjust the learning rate, which helps the model converge better and achieve high - accuracy classification results on the news dataset.
đ License
This project is licensed under the MIT license.