t5-small-machine-articles-tag-generationオープンソースモデル - 機械学習記事の内容を自動的に関連タグに変換

ホーム

T5 Small Machine Articles Tag Generation

nandakishormpaiによって開発

T5-smallをファインチューニングした機械学習記事のタグ生成モデルで、記事内容を自動的に関連タグに変換できます

テキスト生成

Transformers

英語オープンソースライセンス:Apache-2.0 #記事タグ生成 #技術ブログのタグ付け #T5ファインチューニングモデル

ダウンロード数 2,262

リリース時間 : 2/18/2023

モデル概要

このモデルは機械学習関連記事のタグ生成に特化しており、タグ生成をテキストからテキストへの生成タスクとして扱います。19万件のMedium記事データセットから機械学習関連記事を抽出してファインチューニングされており、技術ブログプラットフォームにより具体的なタグ提案を提供できます。

モデル特徴

テキストからテキスト生成

タグ生成を分類タスクではなく生成タスクとして扱うことで、より柔軟なタグ組み合わせを生成可能

ドメイン特化

機械学習分野の記事に最適化されており、タグの関連性が高い

マルチタグ出力

一度に4-5個の関連タグを生成可能で、記事の複数側面をカバー

モデル能力

記事タグ生成

技術コンテンツ分析

マルチタグ出力

機械学習分野の理解

使用事例

コンテンツ管理

技術ブログのタグ生成

機械学習関連ブログ記事に自動的にタグを生成

4-5個の関連タグを生成（例：['Paige', 'AIの病理学とゲノミクスへの応用', '病理学AI', 'ゲノミクス']）

ナレッジオーガナイゼーション

記事分類システム

タグベースの記事分類・検索システム構築を支援

一貫性のある関連タグ提案を提供

🚀 t5-small-machine-articles-tag-generation

このモデルは、機械学習関連の記事にタグを生成するための機械学習モデルです。このモデルは、t5-small を微調整したもので、190k Medium Articles データセットの改良版を使用して、記事のテキスト内容を入力として、機械学習記事のタグを生成するように微調整されています。通常はマルチラベル分類問題として定式化されますが、このモデルは タグ生成 をテキスト生成タスクとして扱っています（参考: fabiochiu/t5-base-tag-generation）。

微調整ノートブックの参考: Hugging face summarization notebook。

🚀 クイックスタート

📦 インストール

pip install transformers nltk

💻 使用例

基本的な使用法

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import nltk
nltk.download('punkt')

tokenizer = AutoTokenizer.from_pretrained("nandakishormpai/t5-small-machine-articles-tag-generation")
model = AutoModelForSeq2SeqLM.from_pretrained("nandakishormpai/t5-small-machine-articles-tag-generation")

article_text = """
Paige, AI in pathology and genomics

Fundamentally transforming the diagnosis and treatment of cancer
Paige has raised $25M in total. We talked with Leo Grady, its CEO.
How would you describe Paige in a single tweet?
AI in pathology and genomics will fundamentally transform the diagnosis and treatment of cancer.
How did it all start and why? 
Paige was founded out of Memorial Sloan Kettering to bring technology that was developed there to doctors and patients worldwide. For over a decade, Thomas Fuchs and his colleagues have developed a new, powerful technology for pathology. This technology can improve cancer diagnostics, driving better patient care at lower cost. Paige is building clinical products from this technology and extending the technology to the development of new biomarkers for the biopharma industry.
What have you achieved so far?
TEAM: In the past year and a half, Paige has built a team with members experienced in AI, entrepreneurship, design and commercialization of clinical software.
PRODUCT: We have achieved FDA breakthrough designation for the first product we plan to launch, a testament to the impact our technology will have in this market.
CUSTOMERS: None yet, as we are working on CE and FDA regulatory clearances. We are working with several biopharma companies.
What do you plan to achieve in the next 2 or 3 years?
Commercialization of multiple clinical products for pathologists, as well as the development of novel biomarkers that can help speed up and better inform the diagnosis and treatment selection for patients with cancer.
"""

inputs = tokenizer([article_text], max_length=1024, truncation=True, return_tensors="pt")
output = model.generate(**inputs, num_beams=8, do_sample=True, min_length=10,
                        max_length=128)

decoded_output = tokenizer.batch_decode(output, skip_special_tokens=True)[0]

tags = [ tag.strip() for tag in decoded_output.split(",")] 

print(tags)

# ['Paige', 'AI in pathology and genomics', 'AI in pathology', 'genomics']

📚 ドキュメント

データセットの準備

Kaggleの190k記事データセットのうち、約12kは機械学習に関するもので、タグはかなり大まかなものでした。技術ブログプラットフォーム用のシステムを開発する際には、より具体的なタグを生成することが有用です。ML記事を抽出し、約1000記事をサンプリングしました。それらにGPT3 APIを使用してタグ付けを行い、生成されたタグに前処理を施して、4または5個のタグを持つ記事を選択し、最終的なデータセットとして約940記事を得ました。