deberta-v3-base-zeroshot-v2.0開源文本分類模型 - 零樣本處理無需訓練數據

首頁

Deberta V3 Base Zeroshot V2.0

由MoritzLaurer開發

基於DeBERTa-v3-base架構的零樣本分類模型，專為無需訓練數據的文本分類任務設計

文本分類

Transformers

英語開源協議:MIT #零樣本分類 #多任務通用 #商業友好數據

下載量 7,845

發布時間 : 3/28/2024

模型概述

該模型是zeroshot-v2.0系列的一部分，使用商業友好的合成數據和NLI數據集訓練，可在GPU和CPU上高效執行零樣本分類任務。

模型特點

商業友好數據訓練

使用Mixtral-8x7B-Instruct生成的合成數據和商業友好的NLI數據集訓練

零樣本分類能力

無需訓練數據即可執行文本分類任務

多類別支持

支持單標籤和多標籤分類模式

高性能

在28個文本分類任務上表現優於facebook/bart-large-mnli基準模型

模型能力

文本分類

零樣本推理

多類別預測

自然語言理解

使用案例

情感分析

產品評論分類

自動分類電商平臺上的產品評論為正面或負面

在亞馬遜極性數據集上達到0.937 F1分數

影評分析

識別IMDB影評的情感傾向

在IMDB數據集上達到0.893 F1分數

內容審核

毒性內容檢測

識別文本中的仇恨言論、侮辱等有毒內容

在維基毒性侮辱數據集上達到0.759 F1分數

偏見檢測

檢測文本中的性別偏見內容

在偏見框架性別數據集上達到0.741 F1分數

金融分析

金融新聞分類

對金融新聞進行情緒分類(正面/中性/負面)

在金融短語庫數據集上達到0.714 F1分數

🚀 deberta-v3-base-zeroshot-v2.0

本項目的 deberta-v3-base-zeroshot-v2.0 模型專注於零樣本分類任務，可在無訓練數據的情況下進行高效分類，支持GPU和CPU運行。該模型系列的部分模型使用了完全商業友好的數據進行訓練，適用於有嚴格許可要求的用戶。

🚀 快速開始

本模型可通過Hugging Face的pipeline進行零樣本分類任務，無需訓練數據，即可在GPU和CPU上運行。下面是一個簡單的使用示例：

#!pip install transformers[sentencepiece]
from transformers import pipeline
text = "Angela Merkel is a politician in Germany and leader of the CDU"
hypothesis_template = "This text is about {}"
classes_verbalized = ["politics", "economy", "entertainment", "environment"]
zeroshot_classifier = pipeline("zero-shot-classification", model="MoritzLaurer/deberta-v3-large-zeroshot-v2.0")  # change the model identifier here
output = zeroshot_classifier(text, classes_verbalized, hypothesis_template=hypothesis_template, multi_label=False)
print(output)

✨ 主要特性

零樣本分類能力：無需訓練數據，即可完成分類任務。
跨平臺運行：支持在GPU和CPU上運行。
商業友好數據訓練：部分模型使用完全商業友好的數據進行訓練，滿足嚴格許可要求。
通用分類任務：可將任何分類任務轉化為判斷假設是否為“真”的任務。

📚 詳細文檔

zeroshot-v2.0系列模型概述

此係列模型專為使用Hugging Face管道進行高效零樣本分類而設計。這些模型無需訓練數據即可進行分類，並且可以在GPU和CPU上運行。最新零樣本分類器的概述可在零樣本分類器集合中找到。

zeroshot-v2.0系列模型的主要更新在於，部分模型針對有嚴格許可要求的用戶，使用完全商業友好的數據進行訓練。這些模型可以完成一項通用分類任務：給定一段文本，判斷一個假設是“真”還是“非真”（entailment與not_entailment）。此任務格式基於自然語言推理任務（NLI），Hugging Face管道可以將任何分類任務重新表述為該任務。

訓練數據

名稱中帶有“-c”的模型在兩種完全商業友好的數據上進行訓練：

合成數據：使用Mixtral-8x7B-Instruct-v0.1生成。首先與Mistral-large對話，為25種職業創建了500多個不同的文本分類任務列表，並手動整理數據。然後使用這些種子數據，通過Mixtral-8x7B-Instruct-v0.1為這些任務生成了數十萬個文本。最終使用的數據集可在synthetic_zeroshot_mixtral_v0.1數據集中的mixtral_written_text_for_tasks_v4子集中找到。數據整理經過多次迭代，未來還將進一步改進。
兩個商業友好的NLI數據集：(MNLI，FEVER-NLI)。添加這些數據集是為了提高模型的泛化能力。

名稱中沒有“-c”的模型還包括了更廣泛的訓練數據，這些數據的許可範圍也更廣，如ANLI、WANLI、LingNLI，以及此列表中used_in_v1.1==True的所有數據集。

何時使用哪種模型

deberta-v3-零樣本與roberta-零樣本：deberta-v3的性能明顯優於roberta，但速度稍慢。roberta與Hugging Face的生產推理TEI容器和閃存注意力直接兼容，這些容器適用於生產用例。簡而言之，為了追求準確性，可使用deberta-v3模型；如果關注生產推理速度，可以考慮使用roberta模型（例如在TEI容器和HF推理端點中）。
商業用例：名稱中帶有“-c”的模型保證僅使用商業友好的數據進行訓練。沒有“-c”的模型使用了更多數據，性能更好，但包含了非商業許可的數據。關於這些訓練數據是否會影響訓練模型的許可，法律意見存在分歧。對於有嚴格法律要求的用戶，建議使用名稱中帶有“-c”的模型。
多語言/非英語用例：使用bge-m3-zeroshot-v2.0或bge-m3-zeroshot-v2.0-c。請注意，多語言模型的性能不如僅支持英語的模型。因此，您也可以先使用EasyNMT等庫將文本機器翻譯為英語，然後將任何僅支持英語的模型應用於翻譯後的數據。如果您的團隊不精通數據中的所有語言，機器翻譯也便於進行驗證。
上下文窗口：bge-m3模型最多可以處理8192個標記，其他模型最多可以處理512個標記。請注意，較長的文本輸入會使模型變慢並降低性能，因此如果您只處理最多400個單詞/1頁的文本，建議使用deberta模型以獲得更好的性能。

最新的模型更新信息可在零樣本分類器集合中查看。

可重複性

復現代碼可在以下目錄中找到：https://github.com/MoritzLaurer/zeroshot-classifier/tree/main

侷限性和偏差

該模型僅能處理文本分類任務。偏差可能來自基礎模型、人類NLI訓練數據以及Mixtral生成的合成數據。

靈活使用和“提示”

您可以通過更改零樣本管道的hypothesis_template來制定自己的假設。類似於大型語言模型的“提示工程”，您可以測試不同的hypothesis_template表述和語言化類別，以提高性能。

from transformers import pipeline
text = "Angela Merkel is a politician in Germany and leader of the CDU"
# 表述1
hypothesis_template = "This text is about {}"
classes_verbalized = ["politics", "economy", "entertainment", "environment"]
# 表述2，根據您的用例而定
hypothesis_template = "The topic of this text is {}"
classes_verbalized = ["political activities", "economic policy", "entertainment or music", "environmental protection"]
# 測試不同的表述
zeroshot_classifier = pipeline("zero-shot-classification", model="MoritzLaurer/deberta-v3-large-zeroshot-v2.0")  # 更改模型標識符
output = zeroshot_classifier(text, classes_verbalized, hypothesis_template=hypothesis_template, multi_label=False)
print(output)

💻 使用示例

基礎用法

#!pip install transformers[sentencepiece]
from transformers import pipeline
text = "Angela Merkel is a politician in Germany and leader of the CDU"
hypothesis_template = "This text is about {}"
classes_verbalized = ["politics", "economy", "entertainment", "environment"]
zeroshot_classifier = pipeline("zero-shot-classification", model="MoritzLaurer/deberta-v3-large-zeroshot-v2.0")  # change the model identifier here
output = zeroshot_classifier(text, classes_verbalized, hypothesis_template=hypothesis_template, multi_label=False)
print(output)

高級用法

from transformers import pipeline
text = "Angela Merkel is a politician in Germany and leader of the CDU"
# 表述1
hypothesis_template = "This text is about {}"
classes_verbalized = ["politics", "economy", "entertainment", "environment"]
# 表述2，根據您的用例而定
hypothesis_template = "The topic of this text is {}"
classes_verbalized = ["political activities", "economic policy", "entertainment or music", "environmental protection"]
# 測試不同的表述
zeroshot_classifier = pipeline("zero-shot-classification", model="MoritzLaurer/deberta-v3-large-zeroshot-v2.0")  # 更改模型標識符
output = zeroshot_classifier(text, classes_verbalized, hypothesis_template=hypothesis_template, multi_label=False)
print(output)

🔧 技術細節

評估指標

模型在28個不同的文本分類任務上使用f1_macro指標進行評估。主要參考點是facebook/bart-large-mnli，在撰寫本文時（2024年4月3日），它是最常用的商業友好型零樣本分類器。

results_aggreg_v2.0

這些數字表示零樣本性能，因為訓練數據中未添加這些數據集的數據。請注意，名稱中沒有“-c”的模型進行了兩次評估：一次不使用這28個數據集中的任何數據，以測試純零樣本性能（相應列中的第一個數字）；最後一次包括每個數據集每個類別最多500個訓練數據點（列中括號內的第二個數字，“fewshot”）。沒有模型在測試數據上進行訓練。

不同數據集的詳細信息可在此處找到：https://github.com/MoritzLaurer/zeroshot-classifier/blob/main/v1_human_data/datasets_overview.csv

📄 許可證

基礎模型根據MIT許可證發佈。訓練數據的許可證因模型而異，請參見上文。

📖 引用

此模型是本文所述研究的擴展。

如果您在學術上使用此模型，請引用：

@misc{laurer_building_2023,
	title = {Building {Efficient} {Universal} {Classifiers} with {Natural} {Language} {Inference}},
	url = {http://arxiv.org/abs/2312.17543},
	doi = {10.48550/arXiv.2312.17543},
	abstract = {Generative Large Language Models (LLMs) have become the mainstream choice for fewshot and zeroshot learning thanks to the universality of text generation. Many users, however, do not need the broad capabilities of generative LLMs when they only want to automate a classification task. Smaller BERT-like models can also learn universal tasks, which allow them to do any text classification task without requiring fine-tuning (zeroshot classification) or to learn new tasks with only a few examples (fewshot), while being significantly more efficient than generative LLMs. This paper (1) explains how Natural Language Inference (NLI) can be used as a universal classification task that follows similar principles as instruction fine-tuning of generative LLMs, (2) provides a step-by-step guide with reusable Jupyter notebooks for building a universal classifier, and (3) shares the resulting universal classifier that is trained on 33 datasets with 389 diverse classes. Parts of the code we share has been used to train our older zeroshot classifiers that have been downloaded more than 55 million times via the Hugging Face Hub as of December 2023. Our new classifier improves zeroshot performance by 9.4\%.},
	urldate = {2024-01-05},
	publisher = {arXiv},
	author = {Laurer, Moritz and van Atteveldt, Wouter and Casas, Andreu and Welbers, Kasper},
	month = dec,
	year = {2023},
	note = {arXiv:2312.17543 [cs]},
	keywords = {Computer Science - Artificial Intelligence, Computer Science - Computation and Language},
}

合作建議或問題諮詢

如果您有問題或合作建議，請通過moritz{at}huggingface{dot}co聯繫我，或在LinkedIn上與我交流。

靈活使用和“提示”

from transformers import pipeline
text = "Angela Merkel is a politician in Germany and leader of the CDU"
# 表述1
hypothesis_template = "This text is about {}"
classes_verbalized = ["politics", "economy", "entertainment", "environment"]
# 表述2，根據您的用例而定
hypothesis_template = "The topic of this text is {}"
classes_verbalized = ["political activities", "economic policy", "entertainment or music", "environmental protection"]
# 測試不同的表述
zeroshot_classifier = pipeline("zero-shot-classification", model="MoritzLaurer/deberta-v3-large-zeroshot-v2.0")  # 更改模型標識符
output = zeroshot_classifier(text, classes_verbalized, hypothesis_template=hypothesis_template, multi_label=False)
print(output)

📋 模型信息表格

屬性	詳情
模型類型	用於零樣本分類的deberta-v3-base模型
訓練數據	名稱中帶有“`-c`”的模型使用兩種完全商業友好的數據進行訓練：一是使用Mixtral-8x7B-Instruct-v0.1生成的合成數據；二是兩個商業友好的NLI數據集（MNLI，FEVER-NLI）。名稱中沒有“`-c`”的模型還包括更廣泛的訓練數據，許可範圍也更廣。