🚀 mStyleDistance Model
A multilingual style embedding model that embeds texts based on writing styles, regardless of content and language.
🚀 Quick Start
This repository contains the model introduced in mStyleDistance: Multilingual Style Embeddings and their Evaluation.
mStyleDistance is a multilingual style embedding model aiming to embed texts with similar writing styles closely and those with different styles far apart, regardless of content and language. It can be useful for stylistic analysis of multilingual text, clustering, authorship identification and verification tasks, and automatic style transfer evaluation. The model is described in the paper StyleDistance/mstyledistance.
This model is a multilingual version of the English - only StyleDistance model.
✨ Features
- Multilingual Support: Operates on text in multiple languages.
- Content - Independence: Focuses on writing styles rather than content.
- Useful for Multiple Tasks: Applicable in stylistic analysis, clustering, and more.
📦 Installation
The model can be used with the sentence - transformers
library. You can install it using the following command:
pip install sentence-transformers
💻 Usage Examples
Basic Usage
from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim
model = SentenceTransformer('StyleDistance/mstyledistance') # Load model
input = model.encode("ÉL TIENE PROBLEMAS PARA LOGRAR LA TEMPERATURA ADECUADA PARA COCINAR LA GALLINA CORNISH.")
others = model.encode(["TOCARÁS LA GUITARRA CON TU AMIGO; SERÁ UNA EXCELENTE OPORTUNIDAD PARA MEJORAR TUS HABILIDADES MUSICALES.", "Él tiene problemas para lograr la temperatura adecuada para cocinar la gallina Cornish."])
print(cos_sim(input, others))
Widget Examples
Here are some widget examples demonstrating the model's capabilities:
- Example 1
- Source Sentence: "彼は技術的な複雑さと格闘し、彼の作品は驚くべき視覚的緊張を生み出した。"
- Comparison Sentences:
- "Serviste mariscos frescos en el condado de Middlesex y áreas circundantes."
- "Él sirvió mariscos frescos en el condado de Middlesex y áreas circundantes."
- Example 2
- Source Sentence: "Bien sûr, ils termineront la construction du pont en une semaine."
- Comparison Sentences:
- "Oh, you mean when I single - handedly tackled that bespoke headboard project?"
- "Remember when I completed that bespoke headboard project on my own?"
- Example 3
- Source Sentence: "我将使用有限的色调和小尺寸进行像素艺术的简化和风格化设计。"
- Comparison Sentences:
- "Я ценю ТТ - пистолет за его огневую мощь; его проникающая способность впечатляет меня."
- "你将使用有限的色调和小尺寸进行像素艺术的简化和风格化设计。"
🔧 Technical Details
Training Data and Variants of StyleDistance
mStyleDistance was contrastively trained on mSynthSTEL, a synthetically generated dataset of positive and negative examples of ~40 style features being used in text in 9 non - English languages. By utilizing this synthetic dataset, mStyleDistance is able to achieve stronger content - independence than other style embedding models currently available and is able to operate on multilingual text.
📄 License
This model is released under the MIT license.
📚 Documentation
Model Information
Property |
Details |
Base Model |
FacebookAI/xlm - roberta - base |
Datasets |
StyleDistance/mstyledistance_training_triplets |
Library Name |
sentence - transformers |
Pipeline Tag |
feature - extraction |
Tags |
datadreamer, datadreamer - 0.35.0, synthetic, sentence - transformers, feature - extraction, sentence - similarity |
Citation
@misc{qiu2025mstyledistancemultilingualstyleembeddings,
title={mStyleDistance: Multilingual Style Embeddings and their Evaluation},
author={Justin Qiu and Jiacheng Zhu and Ajay Patel and Marianna Apidianaki and Chris Callison - Burch},
year={2025},
eprint={2502.15168},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2502.15168},
}
Training Details
This model was trained with a synthetic dataset with DataDreamer 🤖💤. The synthetic dataset card and model card can be found here. The training arguments can be found here.
Funding Acknowledgements
This research is supported in part by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), via the HIATUS Program contract #2022 - 22072200005. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of ODNI, IARPA, or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for governmental purposes notwithstanding any copyright annotation therein.