distilbert-reuters21578オープンソースニュース分類モデル - 無料でデプロイして英語ニュースのテーマ分類を実現

ホーム

Distilbert Reuters21578

tarekziadeによって開発

DistilBERTベースのReuters-21578多ラベルニュース分類モデル。ModApte設定のデータセットでファインチューニングされ、英語ニュースのトピック分類に適しています。

テキスト分類

Transformers

英語オープンソースライセンス:Apache-2.0 #ニュース多ラベル分類 #高精度優先 #Reutersデータセット

ダウンロード数 30

リリース時間 : 12/17/2023

モデル概要

このモデルはReuters-21578データセットでファインチューニングされたDistilBERTバリアントで、多ラベルテキスト分類タスク専用に設計されており、ニュース記事内の複数の関連トピックを識別できます。

モデル特徴

高効率軽量

DistilBERTアーキテクチャを基に、比較的高い性能を維持しながらモデルサイズを大幅に削減

多ラベル分類

ニュース記事の複数の関連トピックラベルを同時に予測可能

精度優先

リコール率よりも精度を優先したモデル設計で、高精度が求められるアプリケーションに適しています

モデル能力

英語ニュース分類

多ラベル予測

トピック識別

使用事例

ニュース分析

ニューストピックタグ付け

ニュース記事に自動的に関連トピックタグを付与

Reuters-21578テストセットでF1スコア0.86を達成

コンテンツ分類システム

ニュースコンテンツ管理システムの自動分類モジュール構築

🚀 ディスティルベルトファインチューニング済み Reuters21578 マルチラベルモデル

このモデルは、Reuters-21578データセットを用いてdistilbert-base-casedをファインチューニングしたもので、ニュースのマルチラベル分類に特化しています。

🚀 クイックスタート

このモデルは、https://huggingface.co/lxyuan/distilbert-finetuned-reuters21578-multilabel からフォークされ、/onnxにONNXバージョンを生成したものです。

✨ 主な機能

Reuters-21578データセットを用いたマルチラベル分類タスクに最適化されています。
面接でのテイクホームテストなどで頻繁に使用されるデータセットを利用しているため、実践的なチャレンジをシミュレートできます。
前処理、特徴抽出、モデル評価などの様々なスキルを磨くことができます。

📦 インストール

このモデルを使用するには、transformersライブラリが必要です。以下のコマンドでインストールできます。

pip install transformers

💻 使用例

基本的な使用法

from transformers import pipeline

pipe = pipeline("text-classification", model="lxyuan/distilbert-finetuned-reuters21578-multilabel", return_all_scores=True)

# dataset["test"]["text"][2]
news_article = (
    "JAPAN TO REVISE LONG-TERM ENERGY DEMAND DOWNWARDS The Ministry of International Trade and "
    "Industry (MITI) will revise its long-term energy supply/demand "
    "outlook by August to meet a forecast downtrend in Japanese "
    "energy demand, ministry officials said. "
    "MITI is expected to lower the projection for primary energy "
    "supplies in the year 2000 to 550 mln kilolitres (kl) from 600 "
    "mln, they said. "
    "The decision follows the emergence of structural changes in "
    "Japanese industry following the rise in the value of the yen "
    "and a decline in domestic electric power demand. "
    "MITI is planning to work out a revised energy supply/demand "
    "outlook through deliberations of committee meetings of the "
    "Agency of Natural Resources and Energy, the officials said. "
    "They said MITI will also review the breakdown of energy "
    "supply sources, including oil, nuclear, coal and natural gas. "
    "Nuclear energy provided the bulk of Japan's electric power "
    "in the fiscal year ended March 31, supplying an estimated 27 "
    "pct on a kilowatt/hour basis, followed by oil (23 pct) and "
    "liquefied natural gas (21 pct), they noted. "
    "REUTER"
)

# dataset["test"]["topics"][2]
target_topics = ['crude', 'nat-gas']

fn_kwargs={"padding": "max_length", "truncation": True, "max_length": 512}
output = pipe(news_article, function_to_apply="sigmoid", **fn_kwargs)

for item in output[0]:
    if item["score"]>=0.5:
        print(item["label"], item["score"])

>>> crude 0.7355073690414429
nat-gas 0.8600426316261292

📚 ドキュメント

全体的な概要と比較表

指標	Scikit-learnベースライン	トランスフォーマーモデル
マイクロ平均F1	0.77	0.86
マクロ平均F1	0.29	0.33
加重平均F1	0.70	0.84
サンプル平均F1	0.75	0.80

適合率と再現率：両方のモデルは、再現率よりも高い適合率を重視しています。クライアント向けのニュース分類モデルでは、誤検出（偽陽性）の影響が誤検知（偽陰性）よりも深刻であり、クライアントに説明するのが難しいため、適合率が優先されます。

クラス不均衡の対処：両方のモデルは、少数クラスに対する性能が低いという一般的な問題を抱えています。ただし、トランスフォーマーモデルは、マクロ平均F1スコアでわずかな改善（0.33対0.29）を示しています。

ゼロサポートラベルの問題：両方のモデルは、いくつかのラベルに対するサポートがゼロであるという問題を抱えています。この問題は、モデルが少数クラスを予測するように適切にチューニングされていないか、データセット自体にこれらのクラスの例が不十分であることを示唆しています。

全体的な性能：トランスフォーマーモデルは、加重平均およびサンプル平均F1スコアにおいて、Scikit-learnベースラインを上回っており、全体的な性能が優れており、ラベル不均衡の対処も良好です。

結論：両方のモデルは高い適合率を示していますが、トランスフォーマーモデルは、考慮されたすべての指標において、Scikit-learnベースラインモデルをわずかに上回っています。適合率と再現率のバランスが良く、少数クラスの対処にも若干の改善が見られます。

学習と評価データ

学習セットとテストセットの両方から、単一の出現ラベルを削除するには、以下のコードを使用します。

# Find Single Appearance Labels
from itertools import chain
from collections import Counter
from datasets import load_dataset

def find_single_appearance_labels(y):
    """Find labels that appear only once in the dataset."""
    all_labels = list(chain.from_iterable(y))
    label_count = Counter(all_labels)
    single_appearance_labels = [label for label, count in label_count.items() if count == 1]
    return single_appearance_labels

# Remove Single Appearance Labels from Dataset
def remove_single_appearance_labels(dataset, single_appearance_labels):
    """Remove samples with single-appearance labels from both train and test sets."""
    for split in ['train', 'test']:
        dataset[split] = dataset[split].filter(lambda x: all(label not in single_appearance_labels for label in x['topics']))
    return dataset

dataset = load_dataset("reuters21578", "ModApte")

# Find and Remove Single Appearance Labels
y_train = [item['topics'] for item in dataset['train']]
single_appearance_labels = find_single_appearance_labels(y_train)
print(f"Single appearance labels: {single_appearance_labels}")
>>> Single appearance labels: ['lin-oil', 'rye', 'red-bean', 'groundnut-oil', 'citruspulp', 'rape-meal', 'corn-oil', 'peseta', 'cotton-oil', 'ringgit', 'castorseed', 'castor-oil', 'lit', 'rupiah', 'skr', 'nkr', 'dkr', 'sun-meal', 'lin-meal', 'cruzado']

print("Removing samples with single-appearance labels...")
dataset = remove_single_appearance_labels(dataset, single_appearance_labels)

unique_labels = set(chain.from_iterable(dataset['train']["topics"]))
print(f"We have {len(unique_labels)} unique labels:\n{unique_labels}")
>>> We have 95 unique labels:
{'veg-oil', 'gold', 'platinum', 'ipi', 'acq', 'carcass', 'wool', 'coconut-oil', 'linseed', 'copper', 'soy-meal', 'jet', 'dlr', 'copra-cake', 'hog', 'rand', 'strategic-metal', 'can', 'tea', 'sorghum', 'livestock', 'barley', 'lumber', 'earn', 'wheat', 'trade', 'soy-oil', 'cocoa', 'inventories', 'income', 'rubber', 'tin', 'iron-steel', 'ship', 'rapeseed', 'wpi', 'sun-oil', 'pet-chem', 'palmkernel', 'nat-gas', 'gnp', 'l-cattle', 'propane', 'rice', 'lead', 'alum', 'instal-debt', 'saudriyal', 'cpu', 'jobs', 'meal-feed', 'oilseed', 'dmk', 'plywood', 'zinc', 'retail', 'dfl', 'cpi', 'crude', 'pork-belly', 'gas', 'money-fx', 'corn', 'tapioca', 'palladium', 'lei', 'cornglutenfeed', 'sunseed', 'potato', 'silver', 'sugar', 'grain', 'groundnut', 'naphtha', 'orange', 'soybean', 'coconut', 'stg', 'cotton', 'yen', 'rape-oil', 'palm-oil', 'oat', 'reserves', 'housing', 'interest', 'coffee', 'fuel', 'austdlr', 'money-supply', 'heat', 'fishmeal', 'bop', 'nickel', 'nzdlr'}

学習手順

Reuters-21578データセットの探索的データ分析：このノートブックは、Reuters-21578データセットの探索的データ分析（EDA）を提供しています。
ReutersのScikit-Learnベースラインモデル：このノートブックは、Scikit-learnを使用してReuters-21578データセットのテキスト分類のベースラインモデルを構築しています。
Reutersのトランスフォーマーモデル：このノートブックは、トランスフォーマーモデルを使用したReuters-21578データセットの高度なテキスト分類について説明しています。
Reutersデータセットのマルチラベル層化サンプリングとハイパーパラメータ探索：このノートブックでは、Hugging Face Trainer APIを使用したマルチラベル層化サンプリングとハイパーパラメータ探索について説明しています。

評価結果

トランスフォーマーモデルの評価結果

分類レポート：

ラベル	適合率	再現率	F1スコア	サポート
acq	0.97	0.93	0.95	719
alum	1.00	0.70	0.82	23
austdlr	0.00	0.00	0.00	0
barley	1.00	0.50	0.67	12
bop	0.79	0.50	0.61	30
can	0.00	0.00	0.00	0
carcass	0.67	0.67	0.67	18
cocoa	1.00	1.00	1.00	18
coconut	0.00	0.00	0.00	2
coconut-oil	0.00	0.00	0.00	2
coffee	0.86	0.89	0.87	27
copper	1.00	0.78	0.88	18
copra-cake	0.00	0.00	0.00	1
corn	0.84	0.87	0.86	55
cornglutenfeed	0.00	0.00	0.00	0
cotton	0.92	0.67	0.77	18
cpi	0.86	0.43	0.57	28
cpu	0.00	0.00	0.00	1
crude	0.87	0.93	0.90	189
dfl	0.00	0.00	0.00	1
dlr	0.72	0.64	0.67	44
dmk	0.00	0.00	0.00	4
earn	0.98	0.99	0.98	1087
fishmeal	0.00	0.00	0.00	0
fuel	0.00	0.00	0.00	10
gas	0.80	0.71	0.75	17
gnp	0.79	0.66	0.72	35
gold	0.95	0.67	0.78	30
grain	0.94	0.92	0.93	146
groundnut	0.00	0.00	0.00	4
heat	0.00	0.00	0.00	5
hog	1.00	0.33	0.50	6
housing	0.00	0.00	0.00	4
income	0.00	0.00	0.00	7
instal-debt	0.00	0.00	0.00	1
interest	0.89	0.67	0.77	131
inventories	0.00	0.00	0.00	0
ipi	1.00	0.58	0.74	12
iron-steel	0.90	0.64	0.75	14
jet	0.00	0.00	0.00	1
jobs	0.92	0.57	0.71	21
l-cattle	0.00	0.00	0.00	2
lead	0.00	0.00	0.00	14
lei	0.00	0.00	0.00	3
linseed	0.00	0.00	0.00	0
livestock	0.63	0.79	0.70	24
lumber	0.00	0.00	0.00	6
meal-feed	0.00	0.00	0.00	17
money-fx	0.78	0.81	0.80	177
money-supply	0.80	0.71	0.75	34
naphtha	0.00	0.00	0.00	4
nat-gas	0.82	0.60	0.69	30
nickel	0.00	0.00	0.00	1
nzdlr	0.00	0.00	0.00	2
oat	0.00	...	...	...