mt5-large-finetuned-mnli-xtreme-xnli開源模型 - 支持15種語言零樣本文本分類

首頁

Mt5 Large Finetuned Mnli Xtreme Xnli

由alan-turing-institute開發

基於多語言T5模型微調，專為零樣本文本分類任務設計，支持15種語言

大型語言模型

Transformers

支持多種語言開源協議:Apache-2.0 #多語言零樣本分類 #NLI任務優化 #跨語言理解

下載量 964

發布時間 : 3/2/2022

模型概述

該模型在多語言自然語言推理數據集上微調，適用於零樣本文本分類任務，特別針對非英語語言場景。

模型特點

多語言支持

支持15種語言的零樣本文本分類任務

NLI微調

在MNLI和xtreme_xnli數據集上進行了專門微調

文本到文本架構

保留T5模型的文本生成特性，通過特定前綴標識任務

模型能力

多語言文本分類

零樣本學習

自然語言推理

使用案例

文本分類

多語言情感分析

無需特定語言訓練數據即可進行情感分類

內容審核

跨語言識別不當內容

🚀 mt5-large-finetuned-mnli-xtreme-xnli

本模型基於預訓練的大型 multilingual-t5（也可從 models 獲取），並在英文 MNLI 和 xtreme_xnli 訓練集上進行微調。它旨在用於零樣本文本分類，靈感來源於 xlm-roberta-large-xnli。

🚀 快速開始

本模型專為零樣本文本分類而設計，尤其適用於英文以外的語言。它在英文 MNLI 和 xtreme_xnli 訓練集（一個多語言自然語言推理數據集）上進行了微調。因此，該模型可用於 XNLI 語料庫中的任何語言：

阿拉伯語
保加利亞語
中文
英語
法語
德語
希臘語
印地語
俄語
西班牙語
斯瓦希里語
泰語
土耳其語
烏爾都語
越南語

根據 xlm-roberta-large-xnli 中的建議，若僅進行英文分類，你可以考慮以下模型：

✨ 主要特性

基於預訓練的多語言 T5 模型進行微調，適用於多語言零樣本文本分類。
微調後保留了文本到文本的特性，輸出為文本形式。
可用於 XNLI 語料庫中的多種語言。

💻 使用示例

基礎用法

from torch.nn.functional import softmax
from transformers import MT5ForConditionalGeneration, MT5Tokenizer

model_name = "alan-turing-institute/mt5-large-finetuned-mnli-xtreme-xnli"

tokenizer = MT5Tokenizer.from_pretrained(model_name)
model = MT5ForConditionalGeneration.from_pretrained(model_name)
model.eval()

sequence_to_classify = "¿A quién vas a votar en 2020?"
candidate_labels = ["Europa", "salud pública", "política"]
hypothesis_template = "Este ejemplo es {}."

ENTAILS_LABEL = "▁0"
NEUTRAL_LABEL = "▁1"
CONTRADICTS_LABEL = "▁2"

label_inds = tokenizer.convert_tokens_to_ids(
    [ENTAILS_LABEL, NEUTRAL_LABEL, CONTRADICTS_LABEL])


def process_nli(premise: str, hypothesis: str):
    """ process to required xnli format with task prefix """
    return "".join(['xnli: premise: ', premise, ' hypothesis: ', hypothesis])


# construct sequence of premise, hypothesis pairs
pairs = [(sequence_to_classify, hypothesis_template.format(label)) for label in
        candidate_labels]
# format for mt5 xnli task
seqs = [process_nli(premise=premise, hypothesis=hypothesis) for
        premise, hypothesis in pairs]
print(seqs)
# ['xnli: premise: ¿A quién vas a votar en 2020? hypothesis: Este ejemplo es Europa.',
# 'xnli: premise: ¿A quién vas a votar en 2020? hypothesis: Este ejemplo es salud pública.',
# 'xnli: premise: ¿A quién vas a votar en 2020? hypothesis: Este ejemplo es política.']

inputs = tokenizer.batch_encode_plus(seqs, return_tensors="pt", padding=True)

out = model.generate(**inputs, output_scores=True, return_dict_in_generate=True,
                     num_beams=1)

# sanity check that our sequences are expected length (1 + start token + end token = 3)
for i, seq in enumerate(out.sequences):
    assert len(
        seq) == 3, f"generated sequence {i} not of expected length, 3." \
                   f" Actual length: {len(seq)}"

# get the scores for our only token of interest
# we'll now treat these like the output logits of a `*ForSequenceClassification` model
scores = out.scores[0]

# scores has a size of the model's vocab.
# However, for this task we have a fixed set of labels
# sanity check that these labels are always the top 3 scoring
for i, sequence_scores in enumerate(scores):
    top_scores = sequence_scores.argsort()[-3:]
    assert set(top_scores.tolist()) == set(label_inds), \
        f"top scoring tokens are not expected for this task." \
        f" Expected: {label_inds}. Got: {top_scores.tolist()}."

# cut down scores to our task labels
scores = scores[:, label_inds]
print(scores)
# tensor([[-2.5697,  1.0618,  0.2088],
#         [-5.4492, -2.1805, -0.1473],
#         [ 2.2973,  3.7595, -0.1769]])


# new indices of entailment and contradiction in scores
entailment_ind = 0
contradiction_ind = 2

# we can show, per item, the entailment vs contradiction probas
entail_vs_contra_scores = scores[:, [entailment_ind, contradiction_ind]]
entail_vs_contra_probas = softmax(entail_vs_contra_scores, dim=1)
print(entail_vs_contra_probas)
# tensor([[0.0585, 0.9415],
#         [0.0050, 0.9950],
#         [0.9223, 0.0777]])


# or we can show probas similar to `ZeroShotClassificationPipeline`
# this gives a zero-shot classification style output across labels
entail_scores = scores[:, entailment_ind]
entail_probas = softmax(entail_scores, dim=0)
print(entail_probas)
# tensor([7.6341e-03, 4.2873e-04, 9.9194e-01])

print(dict(zip(candidate_labels, entail_probas.tolist())))
# {'Europa': 0.007634134963154793,
# 'salud pública': 0.0004287279152777046,
# 'política': 0.9919371604919434}

高級用法

# 注意：TF 等效模型的 `generate` 函數與 PyTorch 版本不完全一致，上述代碼無法直接遷移。
# 該模型目前與現有的 `zero-shot-classification` 管道不兼容。

🔧 技術細節

本模型在 mC4 中的 101 種語言上進行了預訓練，如 mt5 論文所述。然後，它在 mt5_xnli_translate_train 任務上進行了 8000 步的微調，微調方式與官方倉庫中描述的類似，並參考了 Stephen Mayhew 的筆記本。最後，將得到的模型轉換為 :hugging_face: 格式。