mt5-large-finetuned-mnli-xtreme-xnliオープンソースモデル - 15種類の言語に対応するゼロショットテキスト分類

ホーム

Mt5 Large Finetuned Mnli Xtreme Xnli

alan-turing-instituteによって開発

多言語T5モデルをベースにファインチューニングされ、ゼロショットテキスト分類タスク向けに設計されており、15言語をサポート

大規模言語モデル

Transformers

複数言語対応オープンソースライセンス:Apache-2.0 #多言語ゼロショット分類 #NLIタスク最適化 #言語間理解

ダウンロード数 964

リリース時間 : 3/2/2022

モデル概要

このモデルは多言語自然言語推論データセットでファインチューニングされており、ゼロショットテキスト分類タスクに適しており、特に非英語言語シナリオに特化しています。

モデル特徴

多言語サポート

15言語のゼロショットテキスト分類タスクをサポート

NLIファインチューニング

MNLIとxtreme_xnliデータセットで特別にファインチューニング

テキストからテキストアーキテクチャ

T5モデルのテキスト生成特性を保持し、特定のプレフィックスでタスクを識別

モデル能力

多言語テキスト分類

ゼロショット学習

自然言語推論

使用事例

テキスト分類

多言語感情分析

特定の言語トレーニングデータが不要で感情分類が可能

コンテンツモデレーション

言語を超えて不適切なコンテンツを識別

🚀 mt5-large-finetuned-mnli-xtreme-xnli

このモデルは、事前学習された大規模なmultilingual - t5（modelsからも入手可能）を使用し、英語のMNLIとxtreme_xnliのトレーニングセットで微調整したものです。[xlm - roberta - large - xnli](https://huggingface.co/joeddav/xlm - roberta - large - xnli)にインスパイアされ、ゼロショットテキスト分類に使用することを目的としています。

🚀 クイックスタート

このモデルは、ゼロショットテキスト分類、特に英語以外の言語での分類に使用できます。英語のMNLIと多言語NLIデータセットであるxtreme_xnliのトレーニングセットで微調整されています。したがって、XNLIコーパス内の以下の言語で使用できます。

アラビア語
ブルガリア語
中国語
英語
フランス語
ドイツ語
ギリシャ語
ヒンディー語
ロシア語
スペイン語
スワヒリ語
タイ語
トルコ語
ウルドゥー語
ベトナム語

[xlm - roberta - large - xnli](https://huggingface.co/joeddav/xlm - roberta - large - xnli)の推奨事項に従い、英語のみの分類には以下を確認することをおすすめします。

[bart - large - mnli](https://huggingface.co/facebook/bart - large - mnli)
[蒸留されたbart MNLIモデル](https://huggingface.co/models?filter=pipeline_tag%3Azero - shot - classification&search=valhalla)

ゼロショットの例:

モデルは微調整後もテキスト対テキストの特性を保持します。これは、期待される出力がテキストになることを意味します。微調整中に、モデルはNLIタスクに対して、含意、中立、または矛盾にマッピングされる一連の単一トークン応答で応答するように学習します。NLIタスクは、固定プレフィックス「xnli:」で示されます。

以下は、PyTorchを使用して、zero - shot - classificationパイプラインと同様の方法でモデルを使用する例です。LM出力の最初のトークンのロジットを使用して信頼度を表します。

from torch.nn.functional import softmax
from transformers import MT5ForConditionalGeneration, MT5Tokenizer

model_name = "alan-turing-institute/mt5-large-finetuned-mnli-xtreme-xnli"

tokenizer = MT5Tokenizer.from_pretrained(model_name)
model = MT5ForConditionalGeneration.from_pretrained(model_name)
model.eval()

sequence_to_classify = "¿A quién vas a votar en 2020?"
candidate_labels = ["Europa", "salud pública", "política"]
hypothesis_template = "Este ejemplo es {}."

ENTAILS_LABEL = "▁0"
NEUTRAL_LABEL = "▁1"
CONTRADICTS_LABEL = "▁2"

label_inds = tokenizer.convert_tokens_to_ids(
    [ENTAILS_LABEL, NEUTRAL_LABEL, CONTRADICTS_LABEL])


def process_nli(premise: str, hypothesis: str):
    """ process to required xnli format with task prefix """
    return "".join(['xnli: premise: ', premise, ' hypothesis: ', hypothesis])


# construct sequence of premise, hypothesis pairs
pairs = [(sequence_to_classify, hypothesis_template.format(label)) for label in
        candidate_labels]
# format for mt5 xnli task
seqs = [process_nli(premise=premise, hypothesis=hypothesis) for
        premise, hypothesis in pairs]
print(seqs)
# ['xnli: premise: ¿A quién vas a votar en 2020? hypothesis: Este ejemplo es Europa.',
# 'xnli: premise: ¿A quién vas a votar en 2020? hypothesis: Este ejemplo es salud pública.',
# 'xnli: premise: ¿A quién vas a votar en 2020? hypothesis: Este ejemplo es política.']

inputs = tokenizer.batch_encode_plus(seqs, return_tensors="pt", padding=True)

out = model.generate(**inputs, output_scores=True, return_dict_in_generate=True,
                     num_beams=1)

# sanity check that our sequences are expected length (1 + start token + end token = 3)
for i, seq in enumerate(out.sequences):
    assert len(
        seq) == 3, f"generated sequence {i} not of expected length, 3." \\\\
                   f" Actual length: {len(seq)}"

# get the scores for our only token of interest
# we'll now treat these like the output logits of a `*ForSequenceClassification` model
scores = out.scores[0]

# scores has a size of the model's vocab.
# However, for this task we have a fixed set of labels
# sanity check that these labels are always the top 3 scoring
for i, sequence_scores in enumerate(scores):
    top_scores = sequence_scores.argsort()[-3:]
    assert set(top_scores.tolist()) == set(label_inds), \\\\
        f"top scoring tokens are not expected for this task." \\\\
        f" Expected: {label_inds}. Got: {top_scores.tolist()}."

# cut down scores to our task labels
scores = scores[:, label_inds]
print(scores)
# tensor([[-2.5697,  1.0618,  0.2088],
#         [-5.4492, -2.1805, -0.1473],
#         [ 2.2973,  3.7595, -0.1769]])


# new indices of entailment and contradiction in scores
entailment_ind = 0
contradiction_ind = 2

# we can show, per item, the entailment vs contradiction probas
entail_vs_contra_scores = scores[:, [entailment_ind, contradiction_ind]]
entail_vs_contra_probas = softmax(entail_vs_contra_scores, dim=1)
print(entail_vs_contra_probas)
# tensor([[0.0585, 0.9415],
#         [0.0050, 0.9950],
#         [0.9223, 0.0777]])


# or we can show probas similar to `ZeroShotClassificationPipeline`
# this gives a zero-shot classification style output across labels
entail_scores = scores[:, entailment_ind]
entail_probas = softmax(entail_scores, dim=0)
print(entail_probas)
# tensor([7.6341e-03, 4.2873e-04, 9.9194e-01])

print(dict(zip(candidate_labels, entail_probas.tolist())))
# {'Europa': 0.007634134963154793,
# 'salud pública': 0.0004287279152777046,
# 'política': 0.9919371604919434}

残念ながら、TF版の同等のモデルのgenerate関数はPyTorch版と正確には一致しないため、上記のコードは直接転用できません。

このモデルは現在、既存のzero - shot - classificationパイプラインと互換性がありません。

🔧 技術詳細

このモデルは、mt5論文で説明されているように、mC4の101言語のセットで事前学習されました。その後、[公式リポジトリ](https://github.com/google-research/multilingual-t5#fine - tuning)で説明されている方法と同様に、mt5_xnli_translate_trainタスクで8000ステップの微調整を行い、[Stephen Mayhewのノートブック](https://github.com/mayhewsw/multilingual-t5/blob/master/notebooks/mt5 - xnli.ipynb)のガイダンスを参考にしました。結果として得られたモデルは、:hugging_face:形式に変換されました。