Spanish_xlm_xnli Open-source Model - Free Deployment, Achieve Zero-shot Text Classification in Spanish

Spanish Xlm Xnli

Developed by morit

Based on the XLM-Roberta-base model, pretrained on a large multilingual Twitter corpus and fine-tuned on the Spanish XNLI dataset, suitable for zero-shot text classification tasks.

Large Language Model

Transformers

SpanishOpen Source License:MIT #Spanish Zero-Shot Classification #Multilingual Tweet Pretraining #XNLI Fine-Tuning

Downloads 18

Release Time : 12/29/2022

Model Overview

This model specializes in zero-shot text classification for Spanish, particularly effective in hate speech detection, while also demonstrating capabilities across 100 languages.

Model Features

Multilingual Pretraining

Pretrained on 100 languages, equipped with cross-lingual understanding capabilities.

Spanish Optimization

Specifically fine-tuned on the Spanish XNLI dataset, excelling in Spanish language tasks.

Zero-Shot Classification

Capable of classification without task-specific training, supporting flexible application scenarios.

Model Capabilities

Zero-Shot Text Classification

Multilingual Understanding

Hate Speech Detection

Use Cases

Content Moderation

Hate Speech Detection

Identify hate speech content in Spanish social media.

Text Classification

Topic Classification

Classify Spanish texts by topic, such as distinguishing between political and sports content.

Example accuracy 81.3%

## 🚀 XLM-ROBERTA-BASE-XNLI-ES

*This model is designed for zero-shot text classification in hate speech detection, especially effective in Spanish.*

## 🚀 Quick Start
This model can be used with the Zero-Shot Classification pipeline. First, load the model:
```python
from transformers import pipeline
classifier = pipeline("zero-shot-classification",
                      model="morit/spanish_xlm_xnli")

After loading the model, you can classify sequences in the languages mentioned above. You can specify your sequences and a matching hypothesis to classify your proposed candidate labels.

sequence_to_classify = "Creo que Lionel Messi es el mejor futbolista del mundo."

# we can specify candidate labels and hypothesis:
candidate_labels = ["politíca", "futbol"]
hypothesis_template = "Este ejemplo es {}"

# classify using the information provided
classifier(sequence_to_classify, candidate_labels, hypothesis_template=hypothesis_template)

# Output
#{'sequence': 'Creo que Lionel Messi es el mejor futbolista del mundo.',
# 'labels': ['futbol', 'politíca'],
# 'scores': [0.813454806804657, 0.18654517829418182]}

✨ Features

This model takes the XLM-Roberta-base model and continues pre-training on a large multi - language Twitter corpus.
It is developed following a similar strategy as the Tweet Eval framework.
The model is further fine - tuned on the Spanish part of the XNLI training dataset, mainly for Zero - Shot Text Classification in Hate Speech Detection, with a focus on the Spanish language.
Since the base model was pre - trained on 100 different languages, it also shows some effectiveness in other languages.

📚 Documentation

Model description

This model takes the XLM - Roberta - base model which has been continued to pre - train on a large corpus of Twitter in multiple languages. It was developed following a similar strategy as introduced as part of the Tweet Eval framework. The model is further fine - tuned on the Spanish part of the XNLI training dataset.

Intended Usage

This model was developed to do Zero - Shot Text Classification in the realm of Hate Speech Detection. It is focused on the Spanish language as it was fine - tuned on data in said language. Since the base model was pre - trained on 100 different languages, it has shown some effectiveness in other languages. Please refer to the list of languages in the XLM Roberta paper.

Training

This model was pre - trained on a set of 100 languages and followed further training on 198M multilingual tweets as described in the original paper. Further, it was trained on the training set of the XNLI dataset in Spanish, which is a machine - translated version of the MNLI dataset. It was trained for 5 epochs on the XNLI train set and evaluated on the XNLI eval dataset at the end of every epoch to find the best - performing model. The model which had the highest accuracy on the eval set was chosen at the end.

Training Charts from wandb

learning rate: 2e - 5
batch size: 32
max sequence length: 128

It was trained using a GPU (NVIDIA GeForce RTX 3090), resulting in a training time of 1h 47 mins.

Evaluation

The best - performing model was evaluated on the XNLI test set to get a comparable result.

predict_accuracy = 79.20 %

📄 License

This project is licensed under the MIT license.

💻 Usage Examples

Basic Usage

from transformers import pipeline
classifier = pipeline("zero-shot-classification",
                      model="morit/spanish_xlm_xnli")

Advanced Usage

sequence_to_classify = "Creo que Lionel Messi es el mejor futbolista del mundo."

# we can specify candidate labels and hypothesis:
candidate_labels = ["politíca", "futbol"]
hypothesis_template = "Este ejemplo es {}"

# classify using the information provided
classifier(sequence_to_classify, candidate_labels, hypothesis_template=hypothesis_template)

# Output
#{'sequence': 'Creo que Lionel Messi es el mejor futbolista del mundo.',
# 'labels': ['futbol', 'politíca'],
# 'scores': [0.813454806804657, 0.18654517829418182]}


| Property | Details |
|----------|---------|
| Model Type | XLM - ROBERTA - BASE - XNLI - ES |
| Training Data | Pre - trained on 100 languages, further trained on 198M multilingual tweets and the Spanish part of the XNLI training dataset |
| Metrics | accuracy |
| Pipeline Tag | zero - shot - classification |
| Datasets | xnli |
| Language | Spanish |
| License | MIT |

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご