🚀 XLM-Roberta Formality Classifier
This is a multilingual text formality classifier. It's based on XLM-Roberta and trained on a multilingual formality classification dataset, aiming to accurately classify text as formal or informal across multiple languages.
✨ Features
- Multilingual Support: Supports multiple languages including English, French, Italian, and Portuguese.
- High - Performance Classification: Trained on a large - scale multilingual dataset for accurate formality classification.
📦 Installation
No specific installation steps are provided in the original document. If you want to use this model, you need to have the transformers
library installed. You can install it via pip install transformers
.
💻 Usage Examples
Basic Usage
from transformers import XLMRobertaTokenizerFast, XLMRobertaForSequenceClassification
tokenizer = XLMRobertaTokenizerFast.from_pretrained('s-nlp/xlmr_formality_classifier')
model = XLMRobertaForSequenceClassification.from_pretrained('s-nlp/xlmr_formality_classifier')
id2formality = {0: "formal", 1: "informal"}
texts = [
"I like you. I love you",
"Hey, what's up?",
"Siema, co porabiasz?",
"I feel deep regret and sadness about the situation in international politics.",
]
encoding = tokenizer(
texts,
add_special_tokens=True,
return_token_type_ids=True,
truncation=True,
padding="max_length",
return_tensors="pt",
)
output = model(**encoding)
formality_scores = [
{id2formality[idx]: score for idx, score in enumerate(text_scores.tolist())}
for text_scores in output.logits.softmax(dim=1)
]
print(formality_scores)
Output Example
[{'formal': 0.993225634098053, 'informal': 0.006774314679205418},
{'formal': 0.8807966113090515, 'informal': 0.1192033663392067},
{'formal': 0.936184287071228, 'informal': 0.06381577253341675},
{'formal': 0.9986615180969238, 'informal': 0.0013385231141000986}]
📚 Documentation
Model Overview
This is the model presented in the paper "Detecting Text Formality: A Study of Text Classification Approaches". It is an XLM - Roberta - based classifier trained on XFORMAL, a multilingual formality classification dataset.
Results
All Languages
Property |
Details |
Model Type |
XLM - Roberta - based classifier |
Training Data |
XFORMAL |
|
precision |
recall |
f1 - score |
support |
0 |
0.744912 |
0.927790 |
0.826354 |
108019 |
1 |
0.889088 |
0.645630 |
0.748048 |
96845 |
accuracy |
|
|
0.794405 |
204864 |
macro avg |
0.817000 |
0.786710 |
0.787201 |
204864 |
weighted avg |
0.813068 |
0.794405 |
0.789337 |
204864 |
English (EN)
|
precision |
recall |
f1 - score |
support |
0 |
0.800053 |
0.962981 |
0.873988 |
22151 |
1 |
0.945106 |
0.725899 |
0.821124 |
19449 |
accuracy |
|
|
0.852139 |
41600 |
macro avg |
0.872579 |
0.844440 |
0.847556 |
41600 |
weighted avg |
0.867869 |
0.852139 |
0.849273 |
41600 |
French (FR)
|
precision |
recall |
f1 - score |
support |
0 |
0.746709 |
0.925738 |
0.826641 |
21505 |
1 |
0.887305 |
0.650592 |
0.750731 |
19327 |
accuracy |
|
|
0.795504 |
40832 |
macro avg |
0.817007 |
0.788165 |
0.788686 |
40832 |
weighted avg |
0.813257 |
0.795504 |
0.790711 |
40832 |
Italian (IT)
|
precision |
recall |
f1 - score |
support |
0 |
0.721282 |
0.914669 |
0.806545 |
21528 |
1 |
0.864887 |
0.607135 |
0.713445 |
19368 |
accuracy |
|
|
0.769024 |
40896 |
macro avg |
0.793084 |
0.760902 |
0.759995 |
40896 |
weighted avg |
0.789292 |
0.769024 |
0.762454 |
40896 |
Portuguese (PT)
|
precision |
recall |
f1 - score |
support |
0 |
0.717546 |
0.908167 |
0.801681 |
21637 |
1 |
0.853628 |
0.599700 |
0.704481 |
19323 |
accuracy |
|
|
0.762646 |
40960 |
macro avg |
0.785587 |
0.753933 |
0.753081 |
40960 |
weighted avg |
0.781743 |
0.762646 |
0.755826 |
40960 |
📄 License
This model is licensed under the OpenRAIL++ License, which supports the development of various technologies—both industrial and academic—that serve the public good.
📖 Citation
@inproceedings{dementieva-etal-2023-detecting,
title = "Detecting Text Formality: A Study of Text Classification Approaches",
author = "Dementieva, Daryna and
Babakov, Nikolay and
Panchenko, Alexander",
editor = "Mitkov, Ruslan and
Angelova, Galia",
booktitle = "Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing",
month = sep,
year = "2023",
address = "Varna, Bulgaria",
publisher = "INCOMA Ltd., Shoumen, Bulgaria",
url = "https://aclanthology.org/2023.ranlp-1.31",
pages = "274--284",
abstract = "Formality is one of the important characteristics of text documents. The automatic detection of the formality level of a text is potentially beneficial for various natural language processing tasks. Before, two large-scale datasets were introduced for multiple languages featuring formality annotation{---}GYAFC and X-FORMAL. However, they were primarily used for the training of style transfer models. At the same time, the detection of text formality on its own may also be a useful application. This work proposes the first to our knowledge systematic study of formality detection methods based on statistical, neural-based, and Transformer-based machine learning methods and delivers the best-performing models for public usage. We conducted three types of experiments {--} monolingual, multilingual, and cross-lingual. The study shows the overcome of Char BiLSTM model over Transformer-based ones for the monolingual and multilingual formality classification task, while Transformer-based classifiers are more stable to cross-lingual knowledge transfer.",
}