🚀 t5-small-machine-articles-tag-generation
这是一个用于为机器学习相关文章生成标签的机器学习模型。该模型是 t5-small 的微调版本,在经过优化的 190k Medium Articles 数据集上进行微调,以文章的文本内容作为输入来生成机器学习文章标签。通常标签生成问题会被表述为多标签分类问题,但此模型将其作为文本到文本的生成任务来处理(灵感和参考来源:fabiochiu/t5-base-tag-generation)。
微调笔记本参考:Hugging face summarization notebook。
🚀 快速开始
📦 安装指南
pip install transformers nltk
💻 使用示例
基础用法
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import nltk
nltk.download('punkt')
tokenizer = AutoTokenizer.from_pretrained("nandakishormpai/t5-small-machine-articles-tag-generation")
model = AutoModelForSeq2SeqLM.from_pretrained("nandakishormpai/t5-small-machine-articles-tag-generation")
article_text = """
Paige, AI in pathology and genomics
Fundamentally transforming the diagnosis and treatment of cancer
Paige has raised $25M in total. We talked with Leo Grady, its CEO.
How would you describe Paige in a single tweet?
AI in pathology and genomics will fundamentally transform the diagnosis and treatment of cancer.
How did it all start and why?
Paige was founded out of Memorial Sloan Kettering to bring technology that was developed there to doctors and patients worldwide. For over a decade, Thomas Fuchs and his colleagues have developed a new, powerful technology for pathology. This technology can improve cancer diagnostics, driving better patient care at lower cost. Paige is building clinical products from this technology and extending the technology to the development of new biomarkers for the biopharma industry.
What have you achieved so far?
TEAM: In the past year and a half, Paige has built a team with members experienced in AI, entrepreneurship, design and commercialization of clinical software.
PRODUCT: We have achieved FDA breakthrough designation for the first product we plan to launch, a testament to the impact our technology will have in this market.
CUSTOMERS: None yet, as we are working on CE and FDA regulatory clearances. We are working with several biopharma companies.
What do you plan to achieve in the next 2 or 3 years?
Commercialization of multiple clinical products for pathologists, as well as the development of novel biomarkers that can help speed up and better inform the diagnosis and treatment selection for patients with cancer.
"""
inputs = tokenizer([article_text], max_length=1024, truncation=True, return_tensors="pt")
output = model.generate(**inputs, num_beams=8, do_sample=True, min_length=10,
max_length=128)
decoded_output = tokenizer.batch_decode(output, skip_special_tokens=True)[0]
tags = [ tag.strip() for tag in decoded_output.split(",")]
print(tags)
📚 详细文档
数据集准备
在 Kaggle 的 190k 文章数据集中,约有 12k 篇是基于机器学习的,且标签较为宽泛。在为技术博客平台开发系统时,生成更具体的标签会很有帮助。因此,我们筛选出了机器学习相关文章,并从中抽样了约 1000 篇。使用 GPT3 API 为这些文章打标签,然后对生成的标签进行预处理,最终选择每篇文章有 4 或 5 个标签的文章,组成了约 940 篇文章的最终数据集。
预期用途和限制
该模型主要用于为机器学习文章生成标签,也可用于其他技术文章,但准确性和详细程度可能会降低。生成的结果可能包含重复标签,需要在结果的后处理中进行处理。
结果
该模型在评估集上取得了以下结果:
指标 |
数值 |
Loss |
1.8786 |
Rouge1 |
35.5143 |
Rouge2 |
18.6656 |
Rougel |
32.7292 |
Rougelsum |
32.6493 |
Gen Len |
17.5745 |
训练和评估数据
超过 940 篇文章的数据集按照 80:10:10 的比例划分为训练集、验证集和测试集。
训练超参数
训练过程中使用了以下超参数:
属性 |
详情 |
学习率 |
2e-05 |
训练批次大小 |
16 |
评估批次大小 |
16 |
随机种子 |
42 |
优化器 |
Adam(betas=(0.9, 0.999),epsilon=1e-08) |
学习率调度器类型 |
线性 |
训练轮数 |
10 |
混合精度训练 |
Native AMP |
框架版本
- Transformers 4.26.1
- Pytorch 1.13.1+cu116
- Datasets 2.9.0
- Tokenizers 0.13.2
📄 许可证
本项目采用 Apache-2.0 许可证。