bge-m3-zeroshot-v2.0-c開源多語言零樣本文本分類模型

首頁

Bge M3 Zeroshot V2.0 C

由MoritzLaurer開發

基於BAAI/bge-m3-retromae訓練的多語言零樣本文本分類模型，專為商業友好場景設計

文本分類

Transformers

其他開源協議:MIT #零樣本分類 #多語言支持 #商業友好

下載量 67

發布時間 : 4/1/2024

模型概述

該模型採用自然語言推理(NLI)任務格式，支持無需訓練數據的零樣本分類，適用於多語言文本分類任務

模型特點

商業友好數據訓練

僅使用完全商業友好的合成數據和公開NLI數據集訓練

多語言支持

支持多種語言的文本分類任務

長文本處理

支持8192個token的上下文窗口，適合處理較長文本

零樣本學習

無需訓練數據即可執行分類任務

模型能力

多語言文本分類

零樣本學習

自然語言推理

長文本處理

使用案例

內容審核

有害內容檢測

識別文本中的毒性、淫穢、威脅等內容

在維基毒性數據集上達到0.736 F1分數

情感分析

產品評論分類

對Yelp等平臺的用戶評論進行情感極性分類

在Yelp評論數據集上達到0.973 F1分數

主題分類

新聞分類

將新聞文章分類到不同主題類別

在AG新聞數據集上達到0.687 F1分數

🚀 bge-m3-zeroshot-v2.0-c

bge-m3-zeroshot-v2.0-c 是一款專為零樣本分類任務設計的模型，藉助 Hugging Face 管道，無需訓練數據即可高效完成各類文本分類任務，且支持在 GPU 和 CPU 上運行。

🚀 快速開始

本模型主要用於零樣本分類任務，可在無訓練數據的情況下進行分類，且能在 GPU 和 CPU 上運行。你可以通過以下步驟快速開始使用：

#!pip install transformers[sentencepiece]
from transformers import pipeline
text = "Angela Merkel is a politician in Germany and leader of the CDU"
hypothesis_template = "This text is about {}"
classes_verbalized = ["politics", "economy", "entertainment", "environment"]
zeroshot_classifier = pipeline("zero-shot-classification", model="MoritzLaurer/deberta-v3-large-zeroshot-v2.0")  # change the model identifier here
output = zeroshot_classifier(text, classes_verbalized, hypothesis_template=hypothesis_template, multi_label=False)
print(output)

✨ 主要特性

零樣本分類：無需訓練數據，即可完成分類任務。
多平臺支持：可在 GPU 和 CPU 上運行。
商業友好：部分模型使用完全商業友好的數據進行訓練，適合有嚴格許可要求的用戶。

📚 詳細文檔

zeroshot-v2.0 系列模型

該系列模型旨在通過 Hugging Face 管道實現高效的零樣本分類。這些模型無需訓練數據即可進行分類，並且可以在 GPU 和 CPU 上運行。最新零樣本分類器的概述可在零樣本分類器集合中查看。

zeroshot-v2.0 系列模型的主要更新在於，部分模型使用完全商業友好的數據進行訓練，以滿足有嚴格許可要求的用戶。

這些模型可以完成一項通用的分類任務：給定一段文本，判斷一個假設是“真”還是“假”（entailment 與 not_entailment）。此任務格式基於自然語言推理任務（NLI）。該任務具有通用性，任何分類任務都可以通過 Hugging Face 管道轉換為此任務。

訓練數據

名稱中帶有 “-c” 的模型使用兩種完全商業友好的數據進行訓練：

合成數據：使用 Mixtral-8x7B-Instruct-v0.1 生成。首先，與 Mistral-large 對話，為 25 種職業創建了 500 多個不同的文本分類任務列表，並進行了手動整理。然後，使用這些作為種子數據，通過 Mixtral-8x7B-Instruct-v0.1 為這些任務生成了數十萬個文本。最終使用的數據集可在 synthetic_zeroshot_mixtral_v0.1 數據集的 mixtral_written_text_for_tasks_v4 子集中找到。數據整理經過多次迭代，未來還將進一步改進。
商業友好的 NLI 數據集：MNLI 和 FEVER-NLI。添加這些數據集是為了提高模型的泛化能力。
名稱中沒有 “-c” 的模型還使用了更廣泛的訓練數據，包括 ANLI、WANLI、LingNLI 以及此列表中 used_in_v1.1==True 的所有數據集。

何時使用哪種模型

deberta-v3-zeroshot 與 roberta-zeroshot：deberta-v3 的性能明顯優於 roberta，但速度稍慢。roberta 與 Hugging Face 的生產推理 TEI 容器和閃存注意力直接兼容，這些容器適用於生產用例。簡而言之，若追求準確性，可使用 deberta-v3 模型；若關注生產推理速度，可考慮使用 roberta 模型（例如在 TEI 容器和 HF 推理端點中）。
商業用例：名稱中帶有 “-c” 的模型保證僅使用商業友好的數據進行訓練。沒有 “-c” 的模型使用更多數據進行訓練，性能更好，但包含非商業許可的數據。關於這些訓練數據是否會影響訓練模型的許可，法律意見存在分歧。對於有嚴格法律要求的用戶，建議使用名稱中帶有 “-c” 的模型。
多語言/非英語用例：可使用 bge-m3-zeroshot-v2.0 或 bge-m3-zeroshot-v2.0-c。請注意，多語言模型的性能不如僅適用於英語的模型。因此，你也可以使用 EasyNMT 等庫將文本機器翻譯為英語，然後將任何僅適用於英語的模型應用於翻譯後的數據。機器翻譯還便於在團隊成員不熟悉數據中所有語言的情況下進行驗證。
上下文窗口：bge-m3 模型最多可處理 8192 個標記，其他模型最多可處理 512 個標記。請注意，較長的文本輸入會使模型變慢並降低性能，因此，如果你僅處理最多 400 個單詞/1 頁的文本，可使用 deberta 模型以獲得更好的性能。

靈活使用和“提示”

你可以通過更改零樣本管道的 hypothesis_template 來制定自己的假設。類似於大語言模型的“提示工程”，你可以測試不同的 hypothesis_template 表述和語言化類別，以提高性能。

from transformers import pipeline
text = "Angela Merkel is a politician in Germany and leader of the CDU"
# 表述 1
hypothesis_template = "This text is about {}"
classes_verbalized = ["politics", "economy", "entertainment", "environment"]
# 表述 2，根據你的用例而定
hypothesis_template = "The topic of this text is {}"
classes_verbalized = ["political activities", "economic policy", "entertainment or music", "environmental protection"]
# 測試不同的表述
zeroshot_classifier = pipeline("zero-shot-classification", model="MoritzLaurer/deberta-v3-large-zeroshot-v2.0")  # 更改此處的模型標識符
output = zeroshot_classifier(text, classes_verbalized, hypothesis_template=hypothesis_template, multi_label=False)
print(output)

💻 使用示例

基礎用法

#!pip install transformers[sentencepiece]
from transformers import pipeline
text = "Angela Merkel is a politician in Germany and leader of the CDU"
hypothesis_template = "This text is about {}"
classes_verbalized = ["politics", "economy", "entertainment", "environment"]
zeroshot_classifier = pipeline("zero-shot-classification", model="MoritzLaurer/deberta-v3-large-zeroshot-v2.0")  # change the model identifier here
output = zeroshot_classifier(text, classes_verbalized, hypothesis_template=hypothesis_template, multi_label=False)
print(output)

高級用法

from transformers import pipeline
text = "Angela Merkel is a politician in Germany and leader of the CDU"
# 表述 1
hypothesis_template = "This text is about {}"
classes_verbalized = ["politics", "economy", "entertainment", "environment"]
# 表述 2，根據你的用例而定
hypothesis_template = "The topic of this text is {}"
classes_verbalized = ["political activities", "economic policy", "entertainment or music", "environmental protection"]
# 測試不同的表述
zeroshot_classifier = pipeline("zero-shot-classification", model="MoritzLaurer/deberta-v3-large-zeroshot-v2.0")  # 更改此處的模型標識符
output = zeroshot_classifier(text, classes_verbalized, hypothesis_template=hypothesis_template, multi_label=False)
print(output)

🔧 技術細節

模型評估使用 f1_macro 指標，在 28 個不同的文本分類任務上進行。主要參考模型為 facebook/bart-large-mnli。

屬性	詳情
模型類型	零樣本分類模型
訓練數據	名稱中帶有 “`-c`” 的模型使用合成數據（使用 Mixtral-8x7B-Instruct-v0.1 生成）和兩個商業友好的 NLI 數據集（MNLI、FEVER-NLI）進行訓練；名稱中沒有 “`-c`” 的模型還使用了更廣泛的訓練數據，包括 ANLI、WANLI、LingNLI 以及此列表中 `used_in_v1.1==True` 的所有數據集。

模型性能指標

	facebook/bart-large-mnli	roberta-base-zeroshot-v2.0-c	roberta-large-zeroshot-v2.0-c	deberta-v3-base-zeroshot-v2.0-c	deberta-v3-base-zeroshot-v2.0 (fewshot)	deberta-v3-large-zeroshot-v2.0-c	deberta-v3-large-zeroshot-v2.0 (fewshot)	bge-m3-zeroshot-v2.0-c	bge-m3-zeroshot-v2.0 (fewshot)
all datasets mean	0.497	0.587	0.622	0.619	0.643 (0.834)	0.676	0.673 (0.846)	0.59	(0.803)
amazonpolarity (2)	0.937	0.924	0.951	0.937	0.943 (0.961)	0.952	0.956 (0.968)	0.942	(0.951)
imdb (2)	0.892	0.871	0.904	0.893	0.899 (0.936)	0.923	0.918 (0.958)	0.873	(0.917)
appreviews (2)	0.934	0.913	0.937	0.938	0.945 (0.948)	0.943	0.949 (0.962)	0.932	(0.954)
yelpreviews (2)	0.948	0.953	0.977	0.979	0.975 (0.989)	0.988	0.985 (0.994)	0.973	(0.978)
rottentomatoes (2)	0.83	0.802	0.841	0.84	0.86 (0.902)	0.869	0.868 (0.908)	0.813	(0.866)
emotiondair (6)	0.455	0.482	0.486	0.459	0.495 (0.748)	0.499	0.484 (0.688)	0.453	(0.697)
emocontext (4)	0.497	0.555	0.63	0.59	0.592 (0.799)	0.699	0.676 (0.81)	0.61	(0.798)
empathetic (32)	0.371	0.374	0.404	0.378	0.405 (0.53)	0.447	0.478 (0.555)	0.387	(0.455)
financialphrasebank (3)	0.465	0.562	0.455	0.714	0.669 (0.906)	0.691	0.582 (0.913)	0.504	(0.895)
banking77 (72)	0.312	0.124	0.29	0.421	0.446 (0.751)	0.513	0.567 (0.766)	0.387	(0.715)
massive (59)	0.43	0.428	0.543	0.512	0.52 (0.755)	0.526	0.518 (0.789)	0.414	(0.692)
wikitoxic_toxicaggreg (2)	0.547	0.751	0.766	0.751	0.769 (0.904)	0.741	0.787 (0.911)	0.736	(0.9)
wikitoxic_obscene (2)	0.713	0.817	0.854	0.853	0.869 (0.922)	0.883	0.893 (0.933)	0.783	(0.914)
wikitoxic_threat (2)	0.295	0.71	0.817	0.813	0.87 (0.946)	0.827	0.879 (0.952)	0.68	(0.947)
wikitoxic_insult (2)	0.372	0.724	0.798	0.759	0.811 (0.912)	0.77	0.779 (0.924)	0.783	(0.915)
wikitoxic_identityhate (2)	0.473	0.774	0.798	0.774	0.765 (0.938)	0.797	0.806 (0.948)	0.761	(0.931)
hateoffensive (3)	0.161	0.352	0.29	0.315	0.371 (0.862)	0.47	0.461 (0.847)	0.291	(0.823)
hatexplain (3)	0.239	0.396	0.314	0.376	0.369 (0.765)	0.378	0.389 (0.764)	0.29	(0.729)
biasframes_offensive (2)	0.336	0.571	0.583	0.544	0.601 (0.867)	0.644	0.656 (0.883)	0.541	(0.855)
biasframes_sex (2)	0.263	0.617	0.835	0.741	0.809 (0.922)	0.846	0.815 (0.946)	0.748	(0.905)
biasframes_intent (2)	0.616	0.531	0.635	0.554	0.61 (0.881)	0.696	0.687 (0.891)	0.467	(0.868)
agnews (4)	0.703	0.758	0.745	0.68	0.742 (0.898)	0.819	0.771 (0.898)	0.687	(0.892)
yahootopics (10)	0.299	0.543	0.62	0.578	0.564 (0.722)	0.621	0.613 (0.738)	0.587	(0.711)
trueteacher (2)	0.491	0.469	0.402	0.431	0.479 (0.82)	0.459	0.538 (0.846)	0.471	(0.518)
spam (2)	0.505	0.528	0.504	0.507	0.464 (0.973)	0.74	0.597 (0.983)	0.441	(0.978)
wellformedquery (2)	0.407	0.333	0.333	0.335	0.491 (0.769)	0.334	0.429 (0.815)	0.361	(0.718)
manifesto (56)	0.084	0.102	0.182	0.17	0.187 (0.376)	0.258	0.256 (0.408)	0.147	(0.331)
capsotu (21)	0.34	0.479	0.523	0.502	0.477 (0.664)	0.603	0.502 (0.686)	0.472	(0.644)

這些數字表示零樣本性能，因為訓練數據中未包含這些數據集的數據。請注意，名稱中沒有 “-c” 的模型進行了兩次評估：一次不使用這 28 個數據集中的任何數據，以測試純零樣本性能（相應列中的第一個數字）；最後一次包括每個數據集每個類別最多 500 個訓練數據點（列中括號內的第二個數字，“fewshot”）。沒有模型在測試數據上進行訓練。

不同數據集的詳細信息可在此處查看：https://github.com/MoritzLaurer/zeroshot-classifier/blob/main/v1_human_data/datasets_overview.csv

📄 許可證

基礎模型根據 MIT 許可證發佈。訓練數據的許可證因模型而異，詳情見上文。

引用

本模型是此論文中描述的研究的擴展。

如果你在學術上使用此模型，請引用：

@misc{laurer_building_2023,
	title = {Building {Efficient} {Universal} {Classifiers} with {Natural} {Language} {Inference}},
	url = {http://arxiv.org/abs/2312.17543},
	doi = {10.48550/arXiv.2312.17543},
	abstract = {Generative Large Language Models (LLMs) have become the mainstream choice for fewshot and zeroshot learning thanks to the universality of text generation. Many users, however, do not need the broad capabilities of generative LLMs when they only want to automate a classification task. Smaller BERT-like models can also learn universal tasks, which allow them to do any text classification task without requiring fine-tuning (zeroshot classification) or to learn new tasks with only a few examples (fewshot), while being significantly more efficient than generative LLMs. This paper (1) explains how Natural Language Inference (NLI) can be used as a universal classification task that follows similar principles as instruction fine-tuning of generative LLMs, (2) provides a step-by-step guide with reusable Jupyter notebooks for building a universal classifier, and (3) shares the resulting universal classifier that is trained on 33 datasets with 389 diverse classes. Parts of the code we share has been used to train our older zeroshot classifiers that have been downloaded more than 55 million times via the Hugging Face Hub as of December 2023. Our new classifier improves zeroshot performance by 9.4\%.},
	urldate = {2024-01-05},
	publisher = {arXiv},
	author = {Laurer, Moritz and van Atteveldt, Wouter and Casas, Andreu and Welbers, Kasper},
	month = dec,
	year = {2023},
	note = {arXiv:2312.17543 [cs]},
	keywords = {Computer Science - Artificial Intelligence, Computer Science - Computation and Language},
}