t5-small-machine-articles-tag-generation開源模型 - 自動把機器學習文章內容轉成相關標籤

首頁

T5 Small Machine Articles Tag Generation

由nandakishormpai開發

基於T5-small微調的機器學習文章標籤生成模型，可將文章內容自動轉化為相關標籤

文本生成

Transformers

英語開源協議:Apache-2.0 #文章標籤生成 #技術博客標註 #T5微調模型

下載量 2,262

發布時間 : 2/18/2023

模型概述

本模型專門用於生成機器學習相關文章的標籤，將標籤生成視為文本到文本的生成任務。基於19萬篇Medium文章數據集中的機器學習相關文章微調而成，可為技術博客平臺提供更具體的標籤建議。

模型特點

文本到文本生成

將標籤生成視為生成任務而非分類任務，能產生更靈活的標籤組合

領域專注

專門針對機器學習領域文章優化，標籤相關性更高

多標籤輸出

可一次性生成4-5個相關標籤，覆蓋文章多個方面

模型能力

文章標籤生成

技術內容分析

多標籤輸出

機器學習領域理解

使用案例

內容管理

技術博客標籤生成

為機器學習相關博客文章自動生成標籤

生成4-5個相關標籤，如['Paige', 'AI在病理學和基因組學中的應用', '病理學AI', '基因組學']

知識組織

文章分類系統

幫助構建基於標籤的文章分類和檢索系統

提供一致且相關的標籤建議

🚀 t5-small-machine-articles-tag-generation

這是一個用於為機器學習相關文章生成標籤的機器學習模型。該模型是 t5-small 的微調版本，在經過優化的 190k Medium Articles 數據集上進行微調，以文章的文本內容作為輸入來生成機器學習文章標籤。通常標籤生成問題會被表述為多標籤分類問題，但此模型將其作為文本到文本的生成任務來處理（靈感和參考來源：fabiochiu/t5-base-tag-generation）。

微調筆記本參考：Hugging face summarization notebook。

🚀 快速開始

📦 安裝指南

pip install transformers nltk

💻 使用示例

基礎用法

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import nltk
nltk.download('punkt')

tokenizer = AutoTokenizer.from_pretrained("nandakishormpai/t5-small-machine-articles-tag-generation")
model = AutoModelForSeq2SeqLM.from_pretrained("nandakishormpai/t5-small-machine-articles-tag-generation")

article_text = """
Paige, AI in pathology and genomics

Fundamentally transforming the diagnosis and treatment of cancer
Paige has raised $25M in total. We talked with Leo Grady, its CEO.
How would you describe Paige in a single tweet?
AI in pathology and genomics will fundamentally transform the diagnosis and treatment of cancer.
How did it all start and why? 
Paige was founded out of Memorial Sloan Kettering to bring technology that was developed there to doctors and patients worldwide. For over a decade, Thomas Fuchs and his colleagues have developed a new, powerful technology for pathology. This technology can improve cancer diagnostics, driving better patient care at lower cost. Paige is building clinical products from this technology and extending the technology to the development of new biomarkers for the biopharma industry.
What have you achieved so far?
TEAM: In the past year and a half, Paige has built a team with members experienced in AI, entrepreneurship, design and commercialization of clinical software.
PRODUCT: We have achieved FDA breakthrough designation for the first product we plan to launch, a testament to the impact our technology will have in this market.
CUSTOMERS: None yet, as we are working on CE and FDA regulatory clearances. We are working with several biopharma companies.
What do you plan to achieve in the next 2 or 3 years?
Commercialization of multiple clinical products for pathologists, as well as the development of novel biomarkers that can help speed up and better inform the diagnosis and treatment selection for patients with cancer.
"""

inputs = tokenizer([article_text], max_length=1024, truncation=True, return_tensors="pt")
output = model.generate(**inputs, num_beams=8, do_sample=True, min_length=10,
                        max_length=128)

decoded_output = tokenizer.batch_decode(output, skip_special_tokens=True)[0]

tags = [ tag.strip() for tag in decoded_output.split(",")] 

print(tags)

# ['Paige', 'AI in pathology and genomics', 'AI in pathology', 'genomics']

📚 詳細文檔

數據集準備

在 Kaggle 的 190k 文章數據集中，約有 12k 篇是基於機器學習的，且標籤較為寬泛。在為技術博客平臺開發系統時，生成更具體的標籤會很有幫助。因此，我們篩選出了機器學習相關文章，並從中抽樣了約 1000 篇。使用 GPT3 API 為這些文章打標籤，然後對生成的標籤進行預處理，最終選擇每篇文章有 4 或 5 個標籤的文章，組成了約 940 篇文章的最終數據集。

預期用途和限制

該模型主要用於為機器學習文章生成標籤，也可用於其他技術文章，但準確性和詳細程度可能會降低。生成的結果可能包含重複標籤，需要在結果的後處理中進行處理。

結果

該模型在評估集上取得了以下結果：

指標	數值
Loss	1.8786
Rouge1	35.5143
Rouge2	18.6656
Rougel	32.7292
Rougelsum	32.6493
Gen Len	17.5745

訓練和評估數據

超過 940 篇文章的數據集按照 80:10:10 的比例劃分為訓練集、驗證集和測試集。

訓練超參數

訓練過程中使用了以下超參數：

屬性	詳情
學習率	2e-05
訓練批次大小	16
評估批次大小	16
隨機種子	42
優化器	Adam（betas=(0.9, 0.999)，epsilon=1e-08）
學習率調度器類型	線性
訓練輪數	10
混合精度訓練	Native AMP