bart-base-detox開源文本淨化模型 - 免費將有毒文本改寫為中性表達

首頁

Bart Base Detox

由s-nlp開發

基於BART架構的文本淨化模型，能夠將有毒文本改寫為中性表達

機器翻譯

Transformers

英語#文本淨化 #平行數據訓練 #有毒內容改寫

下載量 2,039

發布時間 : 3/2/2022

模型概述

該模型基於BART架構，在ParaDetox平行淨化數據集上訓練完成，專門用於文本淨化任務，可將含有攻擊性或不當語言的文本轉換為中性表達。

模型特點

平行數據訓練

使用ParaDetox平行數據集訓練，包含超過10,000條有毒-中性語句對

最先進性能

在文本淨化任務上表現優於無監督模型

多領域適用

可應用於社交媒體、論壇評論等多種場景的文本淨化

模型能力

文本淨化

文本改寫

中性化表達生成

使用案例

內容審核

社交媒體評論淨化

自動檢測並改寫社交媒體中的攻擊性評論

將有毒評論轉換為中性表達，同時保留原意

在線社區管理

論壇發言淨化

自動處理論壇中的不當言論

維護社區討論環境的文明性

🚀 排毒模型（bart-base-detox）

這是一個用於文本排毒任務的模型，基於BART基礎模型在並行排毒數據集ParaDetox上訓練，在排毒任務中取得了SOTA效果。

🚀 快速開始

本模型是在論文 "ParaDetox: Detoxification with Parallel Data" 中提出的。它基於 BART (base) 模型，在並行排毒數據集ParaDetox上進行訓練，在排毒任務中達到了當前最優結果。更多詳細信息、代碼和數據可在此處找到。

📦 安裝指南

文檔未提及具體安裝步驟，跳過此章節。

💻 使用示例

基礎用法

from transformers import BartForConditionalGeneration, AutoTokenizer
base_model_name = 'facebook/bart-base'
model_name = 's-nlp/bart-base-detox'
tokenizer = AutoTokenizer.from_pretrained(base_model_name)
model = BartForConditionalGeneration.from_pretrained(model_name)

input_ids = tokenizer.encode('This is completely idiotic!', return_tensors='pt')
output_ids = model.generate(input_ids, max_length=50, num_return_sequences=1)
output_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
print(output_text)
# This is unwise!

📚 詳細文檔

模型信息

屬性	詳情
模型類型	BART (base)
訓練數據	s-nlp/paradetox
基礎模型	facebook/bart-base
許可證	OpenRAIL++

引用信息

@inproceedings{logacheva-etal-2022-paradetox,
    title = "{P}ara{D}etox: Detoxification with Parallel Data",
    author = "Logacheva, Varvara  and
      Dementieva, Daryna  and
      Ustyantsev, Sergey  and
      Moskovskiy, Daniil  and
      Dale, David  and
      Krotova, Irina  and
      Semenov, Nikita  and
      Panchenko, Alexander",
    booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = may,
    year = "2022",
    address = "Dublin, Ireland",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.acl-long.469",
    pages = "6804--6818",
    abstract = "We present a novel pipeline for the collection of parallel data for the detoxification task. We collect non-toxic paraphrases for over 10,000 English toxic sentences. We also show that this pipeline can be used to distill a large existing corpus of paraphrases to get toxic-neutral sentence pairs. We release two parallel corpora which can be used for the training of detoxification models. To the best of our knowledge, these are the first parallel datasets for this task.We describe our pipeline in detail to make it fast to set up for a new language or domain, thus contributing to faster and easier development of new parallel resources.We train several detoxification models on the collected data and compare them with several baselines and state-of-the-art unsupervised approaches. We conduct both automatic and manual evaluations. All models trained on parallel data outperform the state-of-the-art unsupervised models by a large margin. This suggests that our novel datasets can boost the performance of detoxification systems.",
}