t5-small-machine-articles-tag-generation开源模型 - 自动把机器学习文章内容转成相关标签

首页

T5 Small Machine Articles Tag Generation

由 nandakishormpai 开发

基于T5-small微调的机器学习文章标签生成模型，可将文章内容自动转化为相关标签

文本生成

Transformers

英语开源协议:Apache-2.0 #文章标签生成 #技术博客标注 #T5微调模型

下载量 2,262

发布时间 : 2/18/2023

模型简介

本模型专门用于生成机器学习相关文章的标签，将标签生成视为文本到文本的生成任务。基于19万篇Medium文章数据集中的机器学习相关文章微调而成，可为技术博客平台提供更具体的标签建议。

模型特点

文本到文本生成

将标签生成视为生成任务而非分类任务，能产生更灵活的标签组合

领域专注

专门针对机器学习领域文章优化，标签相关性更高

多标签输出

可一次性生成4-5个相关标签，覆盖文章多个方面

模型能力

文章标签生成

技术内容分析

多标签输出

机器学习领域理解

使用案例

内容管理

技术博客标签生成

为机器学习相关博客文章自动生成标签

生成4-5个相关标签，如['Paige', 'AI在病理学和基因组学中的应用', '病理学AI', '基因组学']

知识组织

文章分类系统

帮助构建基于标签的文章分类和检索系统

提供一致且相关的标签建议

🚀 t5-small-machine-articles-tag-generation

这是一个用于为机器学习相关文章生成标签的机器学习模型。该模型是 t5-small 的微调版本，在经过优化的 190k Medium Articles 数据集上进行微调，以文章的文本内容作为输入来生成机器学习文章标签。通常标签生成问题会被表述为多标签分类问题，但此模型将其作为文本到文本的生成任务来处理（灵感和参考来源：fabiochiu/t5-base-tag-generation）。

微调笔记本参考：Hugging face summarization notebook。

🚀 快速开始

📦 安装指南

pip install transformers nltk

💻 使用示例

基础用法

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import nltk
nltk.download('punkt')

tokenizer = AutoTokenizer.from_pretrained("nandakishormpai/t5-small-machine-articles-tag-generation")
model = AutoModelForSeq2SeqLM.from_pretrained("nandakishormpai/t5-small-machine-articles-tag-generation")

article_text = """
Paige, AI in pathology and genomics

Fundamentally transforming the diagnosis and treatment of cancer
Paige has raised $25M in total. We talked with Leo Grady, its CEO.
How would you describe Paige in a single tweet?
AI in pathology and genomics will fundamentally transform the diagnosis and treatment of cancer.
How did it all start and why? 
Paige was founded out of Memorial Sloan Kettering to bring technology that was developed there to doctors and patients worldwide. For over a decade, Thomas Fuchs and his colleagues have developed a new, powerful technology for pathology. This technology can improve cancer diagnostics, driving better patient care at lower cost. Paige is building clinical products from this technology and extending the technology to the development of new biomarkers for the biopharma industry.
What have you achieved so far?
TEAM: In the past year and a half, Paige has built a team with members experienced in AI, entrepreneurship, design and commercialization of clinical software.
PRODUCT: We have achieved FDA breakthrough designation for the first product we plan to launch, a testament to the impact our technology will have in this market.
CUSTOMERS: None yet, as we are working on CE and FDA regulatory clearances. We are working with several biopharma companies.
What do you plan to achieve in the next 2 or 3 years?
Commercialization of multiple clinical products for pathologists, as well as the development of novel biomarkers that can help speed up and better inform the diagnosis and treatment selection for patients with cancer.
"""

inputs = tokenizer([article_text], max_length=1024, truncation=True, return_tensors="pt")
output = model.generate(**inputs, num_beams=8, do_sample=True, min_length=10,
                        max_length=128)

decoded_output = tokenizer.batch_decode(output, skip_special_tokens=True)[0]

tags = [ tag.strip() for tag in decoded_output.split(",")] 

print(tags)

# ['Paige', 'AI in pathology and genomics', 'AI in pathology', 'genomics']

📚 详细文档

数据集准备

在 Kaggle 的 190k 文章数据集中，约有 12k 篇是基于机器学习的，且标签较为宽泛。在为技术博客平台开发系统时，生成更具体的标签会很有帮助。因此，我们筛选出了机器学习相关文章，并从中抽样了约 1000 篇。使用 GPT3 API 为这些文章打标签，然后对生成的标签进行预处理，最终选择每篇文章有 4 或 5 个标签的文章，组成了约 940 篇文章的最终数据集。

预期用途和限制

该模型主要用于为机器学习文章生成标签，也可用于其他技术文章，但准确性和详细程度可能会降低。生成的结果可能包含重复标签，需要在结果的后处理中进行处理。

结果

该模型在评估集上取得了以下结果：

指标	数值
Loss	1.8786
Rouge1	35.5143
Rouge2	18.6656
Rougel	32.7292
Rougelsum	32.6493
Gen Len	17.5745

训练和评估数据

超过 940 篇文章的数据集按照 80:10:10 的比例划分为训练集、验证集和测试集。

训练超参数

训练过程中使用了以下超参数：

属性	详情
学习率	2e-05
训练批次大小	16
评估批次大小	16
随机种子	42
优化器	Adam（betas=(0.9, 0.999)，epsilon=1e-08）
学习率调度器类型	线性
训练轮数	10
混合精度训练	Native AMP