bge-m3-zeroshot-v2.0-c开源多语言零样本文本分类模型

首页

Bge M3 Zeroshot V2.0 C

由 MoritzLaurer 开发

基于BAAI/bge-m3-retromae训练的多语言零样本文本分类模型，专为商业友好场景设计

文本分类

Transformers

其他开源协议:MIT #零样本分类 #多语言支持 #商业友好

下载量 67

发布时间 : 4/1/2024

模型简介

该模型采用自然语言推理(NLI)任务格式，支持无需训练数据的零样本分类，适用于多语言文本分类任务

模型特点

商业友好数据训练

仅使用完全商业友好的合成数据和公开NLI数据集训练

多语言支持

支持多种语言的文本分类任务

长文本处理

支持8192个token的上下文窗口，适合处理较长文本

零样本学习

无需训练数据即可执行分类任务

模型能力

多语言文本分类

零样本学习

自然语言推理

长文本处理

使用案例

内容审核

有害内容检测

识别文本中的毒性、淫秽、威胁等内容

在维基毒性数据集上达到0.736 F1分数

情感分析

产品评论分类

对Yelp等平台的用户评论进行情感极性分类

在Yelp评论数据集上达到0.973 F1分数

主题分类

新闻分类

将新闻文章分类到不同主题类别

在AG新闻数据集上达到0.687 F1分数

🚀 bge-m3-zeroshot-v2.0-c

bge-m3-zeroshot-v2.0-c 是一款专为零样本分类任务设计的模型，借助 Hugging Face 管道，无需训练数据即可高效完成各类文本分类任务，且支持在 GPU 和 CPU 上运行。

🚀 快速开始

本模型主要用于零样本分类任务，可在无训练数据的情况下进行分类，且能在 GPU 和 CPU 上运行。你可以通过以下步骤快速开始使用：

#!pip install transformers[sentencepiece]
from transformers import pipeline
text = "Angela Merkel is a politician in Germany and leader of the CDU"
hypothesis_template = "This text is about {}"
classes_verbalized = ["politics", "economy", "entertainment", "environment"]
zeroshot_classifier = pipeline("zero-shot-classification", model="MoritzLaurer/deberta-v3-large-zeroshot-v2.0")  # change the model identifier here
output = zeroshot_classifier(text, classes_verbalized, hypothesis_template=hypothesis_template, multi_label=False)
print(output)

✨ 主要特性

零样本分类：无需训练数据，即可完成分类任务。
多平台支持：可在 GPU 和 CPU 上运行。
商业友好：部分模型使用完全商业友好的数据进行训练，适合有严格许可要求的用户。

📚 详细文档

zeroshot-v2.0 系列模型

该系列模型旨在通过 Hugging Face 管道实现高效的零样本分类。这些模型无需训练数据即可进行分类，并且可以在 GPU 和 CPU 上运行。最新零样本分类器的概述可在零样本分类器集合中查看。

zeroshot-v2.0 系列模型的主要更新在于，部分模型使用完全商业友好的数据进行训练，以满足有严格许可要求的用户。

这些模型可以完成一项通用的分类任务：给定一段文本，判断一个假设是“真”还是“假”（entailment 与 not_entailment）。此任务格式基于自然语言推理任务（NLI）。该任务具有通用性，任何分类任务都可以通过 Hugging Face 管道转换为此任务。

训练数据

名称中带有 “-c” 的模型使用两种完全商业友好的数据进行训练：

合成数据：使用 Mixtral-8x7B-Instruct-v0.1 生成。首先，与 Mistral-large 对话，为 25 种职业创建了 500 多个不同的文本分类任务列表，并进行了手动整理。然后，使用这些作为种子数据，通过 Mixtral-8x7B-Instruct-v0.1 为这些任务生成了数十万个文本。最终使用的数据集可在 synthetic_zeroshot_mixtral_v0.1 数据集的 mixtral_written_text_for_tasks_v4 子集中找到。数据整理经过多次迭代，未来还将进一步改进。
商业友好的 NLI 数据集：MNLI 和 FEVER-NLI。添加这些数据集是为了提高模型的泛化能力。
名称中没有 “-c” 的模型还使用了更广泛的训练数据，包括 ANLI、WANLI、LingNLI 以及此列表中 used_in_v1.1==True 的所有数据集。

何时使用哪种模型

deberta-v3-zeroshot 与 roberta-zeroshot：deberta-v3 的性能明显优于 roberta，但速度稍慢。roberta 与 Hugging Face 的生产推理 TEI 容器和闪存注意力直接兼容，这些容器适用于生产用例。简而言之，若追求准确性，可使用 deberta-v3 模型；若关注生产推理速度，可考虑使用 roberta 模型（例如在 TEI 容器和 HF 推理端点中）。
商业用例：名称中带有 “-c” 的模型保证仅使用商业友好的数据进行训练。没有 “-c” 的模型使用更多数据进行训练，性能更好，但包含非商业许可的数据。关于这些训练数据是否会影响训练模型的许可，法律意见存在分歧。对于有严格法律要求的用户，建议使用名称中带有 “-c” 的模型。
多语言/非英语用例：可使用 bge-m3-zeroshot-v2.0 或 bge-m3-zeroshot-v2.0-c。请注意，多语言模型的性能不如仅适用于英语的模型。因此，你也可以使用 EasyNMT 等库将文本机器翻译为英语，然后将任何仅适用于英语的模型应用于翻译后的数据。机器翻译还便于在团队成员不熟悉数据中所有语言的情况下进行验证。
上下文窗口：bge-m3 模型最多可处理 8192 个标记，其他模型最多可处理 512 个标记。请注意，较长的文本输入会使模型变慢并降低性能，因此，如果你仅处理最多 400 个单词/1 页的文本，可使用 deberta 模型以获得更好的性能。

灵活使用和“提示”

你可以通过更改零样本管道的 hypothesis_template 来制定自己的假设。类似于大语言模型的“提示工程”，你可以测试不同的 hypothesis_template 表述和语言化类别，以提高性能。

from transformers import pipeline
text = "Angela Merkel is a politician in Germany and leader of the CDU"
# 表述 1
hypothesis_template = "This text is about {}"
classes_verbalized = ["politics", "economy", "entertainment", "environment"]
# 表述 2，根据你的用例而定
hypothesis_template = "The topic of this text is {}"
classes_verbalized = ["political activities", "economic policy", "entertainment or music", "environmental protection"]
# 测试不同的表述
zeroshot_classifier = pipeline("zero-shot-classification", model="MoritzLaurer/deberta-v3-large-zeroshot-v2.0")  # 更改此处的模型标识符
output = zeroshot_classifier(text, classes_verbalized, hypothesis_template=hypothesis_template, multi_label=False)
print(output)

💻 使用示例

基础用法

#!pip install transformers[sentencepiece]
from transformers import pipeline
text = "Angela Merkel is a politician in Germany and leader of the CDU"
hypothesis_template = "This text is about {}"
classes_verbalized = ["politics", "economy", "entertainment", "environment"]
zeroshot_classifier = pipeline("zero-shot-classification", model="MoritzLaurer/deberta-v3-large-zeroshot-v2.0")  # change the model identifier here
output = zeroshot_classifier(text, classes_verbalized, hypothesis_template=hypothesis_template, multi_label=False)
print(output)

高级用法

from transformers import pipeline
text = "Angela Merkel is a politician in Germany and leader of the CDU"
# 表述 1
hypothesis_template = "This text is about {}"
classes_verbalized = ["politics", "economy", "entertainment", "environment"]
# 表述 2，根据你的用例而定
hypothesis_template = "The topic of this text is {}"
classes_verbalized = ["political activities", "economic policy", "entertainment or music", "environmental protection"]
# 测试不同的表述
zeroshot_classifier = pipeline("zero-shot-classification", model="MoritzLaurer/deberta-v3-large-zeroshot-v2.0")  # 更改此处的模型标识符
output = zeroshot_classifier(text, classes_verbalized, hypothesis_template=hypothesis_template, multi_label=False)
print(output)

🔧 技术细节

模型评估使用 f1_macro 指标，在 28 个不同的文本分类任务上进行。主要参考模型为 facebook/bart-large-mnli。

属性	详情
模型类型	零样本分类模型
训练数据	名称中带有 “`-c`” 的模型使用合成数据（使用 Mixtral-8x7B-Instruct-v0.1 生成）和两个商业友好的 NLI 数据集（MNLI、FEVER-NLI）进行训练；名称中没有 “`-c`” 的模型还使用了更广泛的训练数据，包括 ANLI、WANLI、LingNLI 以及此列表中 `used_in_v1.1==True` 的所有数据集。

模型性能指标

	facebook/bart-large-mnli	roberta-base-zeroshot-v2.0-c	roberta-large-zeroshot-v2.0-c	deberta-v3-base-zeroshot-v2.0-c	deberta-v3-base-zeroshot-v2.0 (fewshot)	deberta-v3-large-zeroshot-v2.0-c	deberta-v3-large-zeroshot-v2.0 (fewshot)	bge-m3-zeroshot-v2.0-c	bge-m3-zeroshot-v2.0 (fewshot)
all datasets mean	0.497	0.587	0.622	0.619	0.643 (0.834)	0.676	0.673 (0.846)	0.59	(0.803)
amazonpolarity (2)	0.937	0.924	0.951	0.937	0.943 (0.961)	0.952	0.956 (0.968)	0.942	(0.951)
imdb (2)	0.892	0.871	0.904	0.893	0.899 (0.936)	0.923	0.918 (0.958)	0.873	(0.917)
appreviews (2)	0.934	0.913	0.937	0.938	0.945 (0.948)	0.943	0.949 (0.962)	0.932	(0.954)
yelpreviews (2)	0.948	0.953	0.977	0.979	0.975 (0.989)	0.988	0.985 (0.994)	0.973	(0.978)
rottentomatoes (2)	0.83	0.802	0.841	0.84	0.86 (0.902)	0.869	0.868 (0.908)	0.813	(0.866)
emotiondair (6)	0.455	0.482	0.486	0.459	0.495 (0.748)	0.499	0.484 (0.688)	0.453	(0.697)
emocontext (4)	0.497	0.555	0.63	0.59	0.592 (0.799)	0.699	0.676 (0.81)	0.61	(0.798)
empathetic (32)	0.371	0.374	0.404	0.378	0.405 (0.53)	0.447	0.478 (0.555)	0.387	(0.455)
financialphrasebank (3)	0.465	0.562	0.455	0.714	0.669 (0.906)	0.691	0.582 (0.913)	0.504	(0.895)
banking77 (72)	0.312	0.124	0.29	0.421	0.446 (0.751)	0.513	0.567 (0.766)	0.387	(0.715)
massive (59)	0.43	0.428	0.543	0.512	0.52 (0.755)	0.526	0.518 (0.789)	0.414	(0.692)
wikitoxic_toxicaggreg (2)	0.547	0.751	0.766	0.751	0.769 (0.904)	0.741	0.787 (0.911)	0.736	(0.9)
wikitoxic_obscene (2)	0.713	0.817	0.854	0.853	0.869 (0.922)	0.883	0.893 (0.933)	0.783	(0.914)
wikitoxic_threat (2)	0.295	0.71	0.817	0.813	0.87 (0.946)	0.827	0.879 (0.952)	0.68	(0.947)
wikitoxic_insult (2)	0.372	0.724	0.798	0.759	0.811 (0.912)	0.77	0.779 (0.924)	0.783	(0.915)
wikitoxic_identityhate (2)	0.473	0.774	0.798	0.774	0.765 (0.938)	0.797	0.806 (0.948)	0.761	(0.931)
hateoffensive (3)	0.161	0.352	0.29	0.315	0.371 (0.862)	0.47	0.461 (0.847)	0.291	(0.823)
hatexplain (3)	0.239	0.396	0.314	0.376	0.369 (0.765)	0.378	0.389 (0.764)	0.29	(0.729)
biasframes_offensive (2)	0.336	0.571	0.583	0.544	0.601 (0.867)	0.644	0.656 (0.883)	0.541	(0.855)
biasframes_sex (2)	0.263	0.617	0.835	0.741	0.809 (0.922)	0.846	0.815 (0.946)	0.748	(0.905)
biasframes_intent (2)	0.616	0.531	0.635	0.554	0.61 (0.881)	0.696	0.687 (0.891)	0.467	(0.868)
agnews (4)	0.703	0.758	0.745	0.68	0.742 (0.898)	0.819	0.771 (0.898)	0.687	(0.892)
yahootopics (10)	0.299	0.543	0.62	0.578	0.564 (0.722)	0.621	0.613 (0.738)	0.587	(0.711)
trueteacher (2)	0.491	0.469	0.402	0.431	0.479 (0.82)	0.459	0.538 (0.846)	0.471	(0.518)
spam (2)	0.505	0.528	0.504	0.507	0.464 (0.973)	0.74	0.597 (0.983)	0.441	(0.978)
wellformedquery (2)	0.407	0.333	0.333	0.335	0.491 (0.769)	0.334	0.429 (0.815)	0.361	(0.718)
manifesto (56)	0.084	0.102	0.182	0.17	0.187 (0.376)	0.258	0.256 (0.408)	0.147	(0.331)
capsotu (21)	0.34	0.479	0.523	0.502	0.477 (0.664)	0.603	0.502 (0.686)	0.472	(0.644)

这些数字表示零样本性能，因为训练数据中未包含这些数据集的数据。请注意，名称中没有 “-c” 的模型进行了两次评估：一次不使用这 28 个数据集中的任何数据，以测试纯零样本性能（相应列中的第一个数字）；最后一次包括每个数据集每个类别最多 500 个训练数据点（列中括号内的第二个数字，“fewshot”）。没有模型在测试数据上进行训练。

不同数据集的详细信息可在此处查看：https://github.com/MoritzLaurer/zeroshot-classifier/blob/main/v1_human_data/datasets_overview.csv

📄 许可证

基础模型根据 MIT 许可证发布。训练数据的许可证因模型而异，详情见上文。

引用

本模型是此论文中描述的研究的扩展。

如果你在学术上使用此模型，请引用：

@misc{laurer_building_2023,
	title = {Building {Efficient} {Universal} {Classifiers} with {Natural} {Language} {Inference}},
	url = {http://arxiv.org/abs/2312.17543},
	doi = {10.48550/arXiv.2312.17543},
	abstract = {Generative Large Language Models (LLMs) have become the mainstream choice for fewshot and zeroshot learning thanks to the universality of text generation. Many users, however, do not need the broad capabilities of generative LLMs when they only want to automate a classification task. Smaller BERT-like models can also learn universal tasks, which allow them to do any text classification task without requiring fine-tuning (zeroshot classification) or to learn new tasks with only a few examples (fewshot), while being significantly more efficient than generative LLMs. This paper (1) explains how Natural Language Inference (NLI) can be used as a universal classification task that follows similar principles as instruction fine-tuning of generative LLMs, (2) provides a step-by-step guide with reusable Jupyter notebooks for building a universal classifier, and (3) shares the resulting universal classifier that is trained on 33 datasets with 389 diverse classes. Parts of the code we share has been used to train our older zeroshot classifiers that have been downloaded more than 55 million times via the Hugging Face Hub as of December 2023. Our new classifier improves zeroshot performance by 9.4\%.},
	urldate = {2024-01-05},
	publisher = {arXiv},
	author = {Laurer, Moritz and van Atteveldt, Wouter and Casas, Andreu and Welbers, Kasper},
	month = dec,
	year = {2023},
	note = {arXiv:2312.17543 [cs]},
	keywords = {Computer Science - Artificial Intelligence, Computer Science - Computation and Language},
}