bart-base-detoxオープンソーステキスト浄化モデル - 有毒なテキストを無料で中立的な表現に書き換える

ホーム

Bart Base Detox

s-nlpによって開発

BARTアーキテクチャに基づくテキスト浄化モデルで、有害なテキストを中立的な表現に書き換えることが可能

機械翻訳

Transformers

英語#テキスト浄化 #並列データトレーニング #有害コンテンツの書き換え

ダウンロード数 2,039

リリース時間 : 3/2/2022

モデル概要

このモデルはBARTアーキテクチャに基づき、ParaDetox並列浄化データセットでトレーニングされ、テキスト浄化タスク専用に設計されています。攻撃的または不適切な言語を含むテキストを中立的な表現に変換します。

モデル特徴

並列データトレーニング

ParaDetox並列データセットを使用してトレーニングされ、10,000以上の有害-中立文ペアを含む

最先端の性能

テキスト浄化タスクにおいて教師なしモデルよりも優れた性能を発揮

多分野適用可能

ソーシャルメディア、フォーラムコメントなど様々なシナリオのテキスト浄化に適用可能

モデル能力

テキスト浄化

テキスト書き換え

中立的表現生成

使用事例

コンテンツモデレーション

ソーシャルメディアコメント浄化

ソーシャルメディア内の攻撃的なコメントを自動検出し書き換え

有害なコメントを中立的表現に変換しつつ、元の意味を保持

オンラインコミュニティ管理

フォーラム発言浄化

フォーラム内の不適切な発言を自動処理

コミュニティ討論環境の健全性を維持

🚀 s-nlp/bart-base-detox

このモデルは、有害表現を無害化するタスクに特化したモデルです。並列データセットParaDetoxを用いて訓練され、有害表現無害化タスクでSOTAの結果を達成しています。

🚀 クイックスタート

このモデルを使用するには、以下の手順に従ってください。

基本的な使用法

from transformers import BartForConditionalGeneration, AutoTokenizer
base_model_name = 'facebook/bart-base'
model_name = 's-nlp/bart-base-detox'
tokenizer = AutoTokenizer.from_pretrained(base_model_name)
model = BartForConditionalGeneration.from_pretrained(model_name)

input_ids = tokenizer.encode('This is completely idiotic!', return_tensors='pt')
output_ids = model.generate(input_ids, max_length=50, num_return_sequences=1)
output_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
print(output_text)
# This is unwise!

✨ 主な機能

このモデルは、論文"ParaDetox: Detoxification with Parallel Data"で提示されたものです。
BART (base) モデルを並列有害表現無害化データセットParaDetoxで訓練し、有害表現無害化タスクでSOTAの結果を達成しています。
詳細、コード、データはこちらで確認できます。

📦 インストール

このモデルを使用するには、transformersライブラリが必要です。以下のコマンドでインストールできます。

pip install transformers

📚 ドキュメント

モデル情報

属性	详情
モデルタイプ	BART (base)
訓練データ	s-nlp/paradetox
ベースモデル	facebook/bart-base
ライセンス	OpenRAIL++

引用

このモデルを使用する場合は、以下の論文を引用してください。

@inproceedings{logacheva-etal-2022-paradetox,
    title = "{P}ara{D}etox: Detoxification with Parallel Data",
    author = "Logacheva, Varvara  and
      Dementieva, Daryna  and
      Ustyantsev, Sergey  and
      Moskovskiy, Daniil  and
      Dale, David  and
      Krotova, Irina  and
      Semenov, Nikita  and
      Panchenko, Alexander",
    booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = may,
    year = "2022",
    address = "Dublin, Ireland",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.acl-long.469",
    pages = "6804--6818",
    abstract = "We present a novel pipeline for the collection of parallel data for the detoxification task. We collect non-toxic paraphrases for over 10,000 English toxic sentences. We also show that this pipeline can be used to distill a large existing corpus of paraphrases to get toxic-neutral sentence pairs. We release two parallel corpora which can be used for the training of detoxification models. To the best of our knowledge, these are the first parallel datasets for this task.We describe our pipeline in detail to make it fast to set up for a new language or domain, thus contributing to faster and easier development of new parallel resources.We train several detoxification models on the collected data and compare them with several baselines and state-of-the-art unsupervised approaches. We conduct both automatic and manual evaluations. All models trained on parallel data outperform the state-of-the-art unsupervised models by a large margin. This suggests that our novel datasets can boost the performance of detoxification systems.",
}