t5-small-machine-articles-tag-generation Open-source Model - Automatically Convert Machine Learning Article Content into Relevant Tags

T5 Small Machine Articles Tag Generation

Developed by nandakishormpai

A T5-small fine-tuned model for generating tags for machine learning articles, automatically converting article content into relevant tags

Text Generation

Transformers

EnglishOpen Source License:Apache-2.0 #Article Tag Generation #Technical Blog Tagging #T5 Fine-tuned Model

Downloads 2,262

Release Time : 2/18/2023

Model Overview

This model is specifically designed for generating tags for machine learning-related articles, treating tag generation as a text-to-text generation task. Fine-tuned on a dataset of 190,000 Medium articles related to machine learning, it provides more specific tag suggestions for technical blog platforms.

Model Features

Text-to-Text Generation

Treats tag generation as a generation task rather than a classification task, enabling more flexible tag combinations

Domain Specialization

Optimized specifically for machine learning articles, resulting in higher tag relevance

Multi-label Output

Can generate 4-5 related tags at once, covering multiple aspects of the article

Model Capabilities

Article Tag Generation

Technical Content Analysis

Multi-label Output

Machine Learning Domain Understanding

Use Cases

Content Management

Technical Blog Tag Generation

Automatically generates tags for machine learning-related blog articles

Generates 4-5 related tags, such as ['Paige', 'AI in Pathology and Genomics', 'Pathology AI', 'Genomics']

Knowledge Organization

Article Classification System

Helps build tag-based article classification and retrieval systems

Provides consistent and relevant tag suggestions

🚀 t5-small-machine-articles-tag-generation

This is a Machine Learning model designed to generate tags for Machine Learning related articles. It's a fine - tuned version of t5-small, trained on a refined version of the 190k Medium Articles dataset. The model takes the textual content of an article as input to generate tags. Instead of treating it as a multi - label classification problem, it approaches tag generation as a text2text generation task (inspired by fabiochiu/t5-base-tag-generation).

Finetuning Notebook Reference: Hugging face summarization notebook.

🚀 Quick Start

📦 Installation

pip install transformers nltk

💻 Usage Examples

Basic Usage

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import nltk
nltk.download('punkt')

tokenizer = AutoTokenizer.from_pretrained("nandakishormpai/t5-small-machine-articles-tag-generation")
model = AutoModelForSeq2SeqLM.from_pretrained("nandakishormpai/t5-small-machine-articles-tag-generation")

article_text = """
Paige, AI in pathology and genomics

Fundamentally transforming the diagnosis and treatment of cancer
Paige has raised $25M in total. We talked with Leo Grady, its CEO.
How would you describe Paige in a single tweet?
AI in pathology and genomics will fundamentally transform the diagnosis and treatment of cancer.
How did it all start and why? 
Paige was founded out of Memorial Sloan Kettering to bring technology that was developed there to doctors and patients worldwide. For over a decade, Thomas Fuchs and his colleagues have developed a new, powerful technology for pathology. This technology can improve cancer diagnostics, driving better patient care at lower cost. Paige is building clinical products from this technology and extending the technology to the development of new biomarkers for the biopharma industry.
What have you achieved so far?
TEAM: In the past year and a half, Paige has built a team with members experienced in AI, entrepreneurship, design and commercialization of clinical software.
PRODUCT: We have achieved FDA breakthrough designation for the first product we plan to launch, a testament to the impact our technology will have in this market.
CUSTOMERS: None yet, as we are working on CE and FDA regulatory clearances. We are working with several biopharma companies.
What do you plan to achieve in the next 2 or 3 years?
Commercialization of multiple clinical products for pathologists, as well as the development of novel biomarkers that can help speed up and better inform the diagnosis and treatment selection for patients with cancer.
"""

inputs = tokenizer([article_text], max_length=1024, truncation=True, return_tensors="pt")
output = model.generate(**inputs, num_beams=8, do_sample=True, min_length=10,
                        max_length=128)

decoded_output = tokenizer.batch_decode(output, skip_special_tokens=True)[0]

tags = [ tag.strip() for tag in decoded_output.split(",")] 

print(tags)

# ['Paige', 'AI in pathology and genomics', 'AI in pathology', 'genomics']

📚 Documentation

Dataset Preparation

Over the 190k article dataset from Kaggle, around 12k are Machine Learning based with high - level tags. To develop a system for technical blog platforms, more specific tags are needed. ML articles were filtered, and around 1000 articles were sampled. The GPT3 API was used to tag these articles. Then, preprocessing was done on the generated tags. Articles with 4 or 5 tags were selected for the final dataset, which consisted of about 940 articles.

Intended uses & limitations

This model is primarily used to generate tags for Machine Learning articles. It can also be used for other technical articles, but with lower accuracy and detail. The results may contain duplicate tags, which need to be handled during post - processing.

Results

It achieves the following results on the evaluation set:

Loss: 1.8786
Rouge1: 35.5143
Rouge2: 18.6656
Rougel: 32.7292
Rougelsum: 32.6493
Gen Len: 17.5745

Training and evaluation data

The dataset of over 940 articles was split into train, validation, and test sets in an 80:10:10 ratio.

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 2e - 05
train_batch_size: 16
eval_batch_size: 16
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e - 08
lr_scheduler_type: linear
num_epochs: 10
mixed_precision_training: Native AMP

Framework versions

Transformers 4.26.1
Pytorch 1.13.1+cu116
Datasets 2.9.0
Tokenizers 0.13.2

📄 License

This project is licensed under the Apache 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご