mDeBERTa-v3-base-xnli-multilingual-nli-2mil7开源模型 - 支持100种语言零样本分类推理

首页

Mdeberta V3 Base Xnli Multilingual Nli 2mil7

由 MoritzLaurer 开发

基于mDeBERTa-v3-base的多语言自然语言推理模型，支持100种语言的零样本分类

大型语言模型

Transformers

支持多种语言开源协议:MIT #多语言零样本分类 #自然语言推理 #跨语言迁移

下载量 186.62k

发布时间 : 8/22/2022

模型简介

该模型在27种语言的270万NLI文本对上微调，擅长多语言文本分类和自然语言推理任务

模型特点

多语言支持

支持100种语言的文本分类和自然语言推理

零样本分类

无需微调即可对新类别进行分类

大规模训练数据

在270万对多语言NLI文本对上训练

跨语言迁移能力

对未在NLI训练中见过的语言也表现良好

模型能力

多语言文本分类

自然语言推理

零样本分类

跨语言文本理解

使用案例

文本分类

新闻分类

将新闻自动分类到政治、经济、娱乐等类别

在XNLI测试集上平均准确率约80%

内容审核

识别多语言内容中的敏感话题

自然语言理解

语义关系判断

判断两个句子之间的蕴含、中立或矛盾关系

在MultiNLI测试集上准确率约85%

🚀 mDeBERTa-v3-base-xnli-multilingual-nli-2mil7模型

mDeBERTa-v3-base-xnli-multilingual-nli-2mil7是一个多语言模型，能够在100种语言上执行自然语言推理（NLI）任务，适用于多语言零样本分类。该模型基于mDeBERTa-v3-base预训练模型，在多语言数据集上微调得到，具有广泛的语言适应性和良好的性能。

🚀 快速开始

简单的零样本分类管道

from transformers import pipeline
classifier = pipeline("zero-shot-classification", model="MoritzLaurer/mDeBERTa-v3-base-mnli-xnli")
sequence_to_classify = "Angela Merkel ist eine Politikerin in Deutschland und Vorsitzende der CDU"
candidate_labels = ["politics", "economy", "entertainment", "environment"]
output = classifier(sequence_to_classify, candidate_labels, multi_label=False)
print(output)

NLI使用案例

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")

model_name = "MoritzLaurer/mDeBERTa-v3-base-xnli-multilingual-nli-2mil7"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

premise = "Angela Merkel ist eine Politikerin in Deutschland und Vorsitzende der CDU"
hypothesis = "Emmanuel Macron is the President of France"

input = tokenizer(premise, hypothesis, truncation=True, return_tensors="pt")
output = model(input["input_ids"].to(device))  # device = "cuda:0" or "cpu"
prediction = torch.softmax(output["logits"][0], -1).tolist()
label_names = ["entailment", "neutral", "contradiction"]
prediction = {name: round(float(pred) * 100, 1) for pred, name in zip(prediction, label_names)}
print(prediction)

✨ 主要特性

多语言支持：该模型可以在100种语言上执行自然语言推理（NLI）任务，适用于多语言零样本分类。
高性能：mDeBERTa-v3-base是微软推出的表现最佳的多语言基础尺寸Transformer模型。
丰富的训练数据：模型在多个数据集上进行了微调，包括XNLI数据集和multilingual-NLI-26lang-2mil7数据集，这些数据集包含了超过270万个假设 - 前提对，涵盖27种语言。

📦 安装指南

文档未提及安装步骤，故跳过此章节。

💻 使用示例

基础用法

from transformers import pipeline
classifier = pipeline("zero-shot-classification", model="MoritzLaurer/mDeBERTa-v3-base-mnli-xnli")
sequence_to_classify = "Angela Merkel ist eine Politikerin in Deutschland und Vorsitzende der CDU"
candidate_labels = ["politics", "economy", "entertainment", "environment"]
output = classifier(sequence_to_classify, candidate_labels, multi_label=False)
print(output)

高级用法

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")

model_name = "MoritzLaurer/mDeBERTa-v3-base-xnli-multilingual-nli-2mil7"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

premise = "Angela Merkel ist eine Politikerin in Deutschland und Vorsitzende der CDU"
hypothesis = "Emmanuel Macron is the President of France"

input = tokenizer(premise, hypothesis, truncation=True, return_tensors="pt")
output = model(input["input_ids"].to(device))  # device = "cuda:0" or "cpu"
prediction = torch.softmax(output["logits"][0], -1).tolist()
label_names = ["entailment", "neutral", "contradiction"]
prediction = {name: round(float(pred) * 100, 1) for pred, name in zip(prediction, label_names)}
print(prediction)

📚 详细文档

模型描述

该多语言模型可以在100种语言上执行自然语言推理（NLI）任务，因此也适用于多语言零样本分类。基础的mDeBERTa-v3-base模型由微软在包含100种语言的CC100多语言数据集上进行预训练。然后，该模型在XNLI数据集和multilingual-NLI-26lang-2mil7数据集上进行了微调。这两个数据集包含超过270万个假设 - 前提对，涵盖27种语言，使用人口超过40亿。

截至2021年12月，mDeBERTa-v3-base是微软在这篇论文中推出的表现最佳的多语言基础尺寸Transformer模型。

训练数据

该模型在multilingual-nli-26lang-2mil7数据集和XNLI验证数据集上进行训练。

multilingual-nli-26lang-2mil7数据集包含273万个NLI假设 - 前提对，涵盖26种语言，使用人口超过40亿。每个语言包含10.5万个文本对。该数据集基于英文数据集MultiNLI、Fever-NLI、ANLI、LingNLI和WANLI，并使用最新的开源机器翻译模型创建。数据集中的语言包括：['ar', 'bn', 'de', 'es', 'fa', 'fr', 'he', 'hi', 'id', 'it', 'ja', 'ko', 'mr', 'nl', 'pl', 'ps', 'pt', 'ru', 'sv', 'sw', 'ta', 'tr', 'uk', 'ur', 'vi', 'zh']（见ISO语言代码）。更多详细信息，请参阅datasheet。此外，按照与其他语言相同的采样方法，为英语添加了10.5万个文本对，使语言总数达到27种。

此外，对于每种语言，还添加了10%的随机假设 - 前提对，其中英语假设与其他语言的前提配对（英语前提与其他语言假设同理）。这种文本对中的语言混合应该使用户能够为另一种语言的目标文本用英语制定假设。

XNLI验证集由2490篇从英语专业翻译为其他14种语言的文本组成（总共37350篇文本）（见这篇论文）。请注意，XNLI还包含一个由MultiNLI数据集的14种机器翻译版本组成的训练集，涵盖14种语言，但由于2018年机器翻译的质量问题，该数据被排除在外。

请注意，为了评估目的，从XNLI训练数据中排除了三种语言，仅将其包含在测试数据中：["bg","el","th"]。这样做是为了测试模型在NLI微调期间未见过的语言上的性能，这些语言仅在100种语言的预训练期间见过 - 见下面的评估指标。

总训练数据集包含3287280个假设 - 前提对。

训练过程

mDeBERTa-v3-base-mnli-xnli使用Hugging Face训练器进行训练，使用以下超参数：

training_args = TrainingArguments(
    num_train_epochs=3,              # total number of training epochs
    learning_rate=2e-05,
    per_device_train_batch_size=32,   # batch size per device during training
    gradient_accumulation_steps=2,   # to double the effective batch size for 
    warmup_ratio=0.06,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    fp16=False
)

评估结果

该模型在XNLI测试集的15种语言（每种语言5010篇文本，总共75150篇）以及MultiNLI、Fever-NLI、ANLI、LingNLI和WANLI的英文测试集上进行了评估。请注意，多语言NLI模型能够在未接受特定语言NLI训练数据的情况下对NLI文本进行分类（跨语言迁移）。这意味着该模型也能够在mDeBERTa预训练的其他73种语言上进行NLI，但性能很可能低于在27种语言上进行NLI微调时见过的语言。下表中["bg","el","th"]语言的性能很好地表明了这种跨语言迁移，因为这些语言在NLI微调期间未被包含在训练数据中，但仅在100种语言的预训练期间见过。

XNLI子集	阿拉伯语(ar)	保加利亚语(bg)	德语(de)	希腊语(el)	英语(en)	西班牙语(es)	法语(fr)	印地语(hi)	俄语(ru)	斯瓦希里语(sw)	泰语(th)	土耳其语(tr)	乌尔都语(ur)	越南语(vi)	中文(zh)
准确率	0.794	0.822	0.824	0.809	0.871	0.832	0.823	0.769	0.803	0.746	0.786	0.792	0.744	0.793	0.803
速度 (文本/秒, A100-GPU)	1344.0	1355.0	1472.0	1149.0	1697.0	1446.0	1278.0	1115.0	1380.0	1463.0	1713.0	1594.0	1189.0	877.0	1887.0

英文数据集	MultiNLI测试匹配集(mnli_test_m)	MultiNLI测试不匹配集(mnli_test_mm)	ANLI测试集(anli_test)	ANLI测试集r3(anli_test_r3)	Fever测试集(fever_test)	Ling测试集(ling_test)	WANLI测试集(wanli_test)
准确率	0.857	0.856	0.537	0.497	0.761	0.788	0.732
速度 (文本/秒, A100-GPU)	1000.0	1009.0	794.0	672.0	374.0	1177.0	1468.0

还请注意，如果模型中心的其他多语言模型声称在非英语语言上的性能约为90%，作者很可能在测试期间犯了错误，因为最新的论文中没有一篇显示XNLI上的多语言平均性能比80%高出几个百分点（见这里或这里）。

🔧 技术细节

文档未提供具体的技术实现细节，故跳过此章节。

📄 许可证

本项目采用MIT许可证。

⚠️ 局限性和偏差

请参考原始的DeBERTa-V3论文和不同NLI数据集的相关文献，以了解潜在的偏差。此外，请注意，multilingual-nli-26lang-2mil7数据集是使用机器翻译创建的，这会降低像NLI这样复杂任务的数据质量。你可以通过Hugging Face 数据集查看器检查你感兴趣的语言的数据。请注意，机器翻译引入的语法错误对于零样本分类来说问题较小，因为语法在零样本分类中不太重要。

📖 引用

如果该数据集对你有用，请引用以下文章：

@article{laurer_less_2022,
	title = {Less {Annotating}, {More} {Classifying} – {Addressing} the {Data} {Scarcity} {Issue} of {Supervised} {Machine} {Learning} with {Deep} {Transfer} {Learning} and {BERT} - {NLI}},
	url = {https://osf.io/74b8k},
	language = {en-us},
	urldate = {2022-07-28},
	journal = {Preprint},
	author = {Laurer, Moritz and Atteveldt, Wouter van and Casas, Andreu Salleras and Welbers, Kasper},
	month = jun,
	year = {2022},
	note = {Publisher: Open Science Framework},
}

💬 合作建议或问题咨询

如需了解新模型和数据集的更新信息，请在Twitter上关注作者。如果你有问题或合作建议，请通过m{dot}laurer{at}vu{dot}nl联系作者，或在LinkedIn上联系。

🐞 调试与问题

请注意，DeBERTa-v3于2021年底发布，较旧版本的HF Transformers在运行该模型时似乎存在问题（例如，导致分词器出现问题）。使用Transformers==4.13或更高版本可能会解决一些问题。请注意，mDeBERTa目前不支持FP16，详情请见：https://github.com/microsoft/DeBERTa/issues/77。