XLM - RoBERTa - large - XNLI Open - source Model: Free Zero - shot Text Classification in 15 Languages

Xlm Roberta Large Xnli

Developed by joeddav

Based on the xlm-roberta-large pre-trained model, fine-tuned on NLI data in 15 languages, specifically designed for zero-shot text classification

Large Language Model

Transformers

Supports Multiple LanguagesOpen Source License:MIT #Multilingual Zero-Shot Classification #Cross-Lingual NLI #XNLI Fine-Tuning

Downloads 109.12k

Release Time : 3/2/2022

Model Overview

Supports multilingual zero-shot text classification tasks, particularly suitable for non-English languages, fine-tuned on the cross-lingual NLI dataset XNLI

Model Features

Multilingual Support

Supports zero-shot classification in 15 languages, with the base model pre-trained on 100 languages

Cross-Lingual Capability

Labels and texts to be classified can be in different languages, enabling cross-lingual classification

NLI Fine-Tuning

Fine-tuned on MNLI and XNLI datasets for natural language inference tasks

Model Capabilities

Zero-Shot Text Classification

Multilingual Text Understanding

Cross-Lingual Inference

Use Cases

Text Classification

Political Text Classification

Multi-label classification of political-related texts (e.g., elections, foreign policy, etc.)

Can accurately identify the political domain of the text

Cross-Lingual Content Moderation

Classification and moderation of multilingual user-generated content

No need to train separate models for each language

🚀 xlm-roberta-large-xnli

This model is designed for zero-shot text classification, leveraging fine - tuned [xlm - roberta - large](https://huggingface.co/xlm - roberta - large) on NLI data in 15 languages. It's a powerful tool for classifying text across multiple languages without prior training on specific labels.

🚀 Quick Start

With the zero - shot classification pipeline

The model can be loaded with the zero - shot - classification pipeline like so:

from transformers import pipeline
classifier = pipeline("zero - shot - classification",
                      model="joeddav/xlm - roberta - large - xnli")

You can then classify in any of the supported languages. You can even pass the labels in one language and the sequence to classify in another:

# we will classify the Russian translation of, "Who are you voting for in 2020?"
sequence_to_classify = "За кого вы голосуете в 2020 году?"
# we can specify candidate labels in Russian or any other language above:
candidate_labels = ["Europe", "public health", "politics"]
classifier(sequence_to_classify, candidate_labels)
# {'labels': ['politics', 'Europe', 'public health'],
#  'scores': [0.9048484563827515, 0.05722189322113991, 0.03792969882488251],
#  'sequence': 'За кого вы голосуете в 2020 году?'}

The default hypothesis template is the English, This text is {}. If you are working strictly within one language, it may be worthwhile to translate this to the language you are working with:

sequence_to_classify = "¿A quién vas a votar en 2020?"
candidate_labels = ["Europa", "salud pública", "política"]
hypothesis_template = "Este ejemplo es {}."
classifier(sequence_to_classify, candidate_labels, hypothesis_template=hypothesis_template)
# {'labels': ['política', 'Europa', 'salud pública'],
#  'scores': [0.9109585881233215, 0.05954807624220848, 0.029493311420083046],
#  'sequence': '¿A quién vas a votar en 2020?'}

With manual PyTorch

# pose sequence as a NLI premise and label as a hypothesis
from transformers import AutoModelForSequenceClassification, AutoTokenizer
nli_model = AutoModelForSequenceClassification.from_pretrained('joeddav/xlm - roberta - large - xnli')
tokenizer = AutoTokenizer.from_pretrained('joeddav/xlm - roberta - large - xnli')

premise = sequence
hypothesis = f'This example is {label}.'

# run through model pre - trained on MNLI
x = tokenizer.encode(premise, hypothesis, return_tensors='pt',
                     truncation_strategy='only_first')
logits = nli_model(x.to(device))[0]

# we throw away "neutral" (dim 1) and take the probability of
# "entailment" (2) as the probability of the label being true 
entail_contradiction_logits = logits[:,[0,2]]
probs = entail_contradiction_logits.softmax(dim=1)
prob_label_is_true = probs[:,1]

✨ Features

Multilingual Support: The model supports 15 languages in the XNLI corpus including English, French, Spanish, etc. Since the base model was pre - trained on 100 different languages, it may also work for other languages.
Zero - Shot Classification: It can perform zero - shot text classification, allowing users to classify text without prior training on specific labels.

📦 Installation

No specific installation steps are provided in the original document.

📚 Documentation

Model Description

This model takes [xlm - roberta - large](https://huggingface.co/xlm - roberta - large) and fine - tunes it on a combination of NLI data in 15 languages. It is intended to be used for zero - shot text classification, such as with the Hugging Face ZeroShotClassificationPipeline.

Intended Usage

This model is intended to be used for zero - shot text classification, especially in languages other than English. It is fine - tuned on XNLI, which is a multilingual NLI dataset. The model can therefore be used with any of the languages in the XNLI corpus:

English
French
Spanish
German
Greek
Bulgarian
Russian
Turkish
Arabic
Vietnamese
Thai
Chinese
Hindi
Swahili
Urdu Since the base model was pre - trained on 100 different languages, the model has shown some effectiveness in languages beyond those listed above as well. See the full list of pre - trained languages in appendix A of the XLM Roberata paper For English - only classification, it is recommended to use [bart - large - mnli](https://huggingface.co/facebook/bart - large - mnli) or [a distilled bart MNLI model](https://huggingface.co/models?filter=pipeline_tag%3Azero - shot - classification&search=valhalla).

🔧 Technical Details

This model was pre - trained on set of 100 languages, as described in the original paper. It was then fine - tuned on the task of NLI on the concatenated MNLI train set and the XNLI validation and test sets. Finally, it was trained for one additional epoch on only XNLI data where the translations for the premise and hypothesis are shuffled such that the premise and hypothesis for each example come from the same original English example but the premise and hypothesis are of different languages.

📄 License

This model is released under the MIT license.

Property	Details
Model Type	Fine - tuned xlm - roberta - large for zero - shot text classification
Training Data	Concatenated MNLI train set, XNLI validation and test sets, and shuffled XNLI data
Supported Languages	English, French, Spanish, German, Greek, Bulgarian, Russian, Turkish, Arabic, Vietnamese, Thai, Chinese, Hindi, Swahili, Urdu
Pipeline Tag	zero - shot - classification

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご