mstyledistance: An Open-Source Multilingual Style Embedding Model - Distinguish Text Writing Styles without Language Restrictions

Mstyledistance

Developed by StyleDistance

mStyleDistance is a multilingual style embedding model designed to closely embed texts with similar writing styles while distancing those with different styles, regardless of content or language constraints.

Text Embedding

Safetensors

Open Source License:MIT #Multilingual Style Embedding #Content-Agnostic Style Analysis #Cross-Language Author Identification

Downloads 207

Release Time : 11/21/2024

Model Overview

This model can be used for multilingual text style analysis, clustering, author identification and verification tasks, as well as automated style transfer evaluation.

Model Features

Multilingual Style Embedding

Supports style analysis of multilingual texts without language restrictions.

Content-Agnostic

Achieves stronger content-agnostic capabilities through synthetic data training.

Style Clustering

Embeds texts with similar styles closely while distancing those with different styles.

Model Capabilities

Multilingual text style analysis

Text style clustering

Author identification and verification

Automated style transfer evaluation

Use Cases

Text Analysis

Author Identification

Identify potential authors by analyzing text styles.

Style Transfer Evaluation

Evaluate the effectiveness of automated style transfer.

Multilingual Processing

Cross-Language Style Comparison

Compare style similarities between texts in different languages.

🚀 mStyleDistance Model

A multilingual style embedding model that embeds texts based on writing styles, regardless of content and language.

🚀 Quick Start

This repository contains the model introduced in mStyleDistance: Multilingual Style Embeddings and their Evaluation.

mStyleDistance is a multilingual style embedding model aiming to embed texts with similar writing styles closely and those with different styles far apart, regardless of content and language. It can be useful for stylistic analysis of multilingual text, clustering, authorship identification and verification tasks, and automatic style transfer evaluation. The model is described in the paper StyleDistance/mstyledistance.

This model is a multilingual version of the English - only StyleDistance model.

✨ Features

Multilingual Support: Operates on text in multiple languages.
Content - Independence: Focuses on writing styles rather than content.
Useful for Multiple Tasks: Applicable in stylistic analysis, clustering, and more.

📦 Installation

The model can be used with the sentence - transformers library. You can install it using the following command:

pip install sentence-transformers

💻 Usage Examples

Basic Usage

from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim

model = SentenceTransformer('StyleDistance/mstyledistance') # Load model

input = model.encode("ÉL TIENE PROBLEMAS PARA LOGRAR LA TEMPERATURA ADECUADA PARA COCINAR LA GALLINA CORNISH.")
others = model.encode(["TOCARÁS LA GUITARRA CON TU AMIGO; SERÁ UNA EXCELENTE OPORTUNIDAD PARA MEJORAR TUS HABILIDADES MUSICALES.", "Él tiene problemas para lograr la temperatura adecuada para cocinar la gallina Cornish."])
print(cos_sim(input, others))

Widget Examples

Here are some widget examples demonstrating the model's capabilities:

Example 1
- Source Sentence: "彼は技術的な複雑さと格闘し、彼の作品は驚くべき視覚的緊張を生み出した。"
- Comparison Sentences:
  - "Serviste mariscos frescos en el condado de Middlesex y áreas circundantes."
  - "Él sirvió mariscos frescos en el condado de Middlesex y áreas circundantes."
Example 2
- Source Sentence: "Bien sûr, ils termineront la construction du pont en une semaine."
- Comparison Sentences:
  - "Oh, you mean when I single - handedly tackled that bespoke headboard project?"
  - "Remember when I completed that bespoke headboard project on my own?"
Example 3
- Source Sentence: "我将使用有限的色调和小尺寸进行像素艺术的简化和风格化设计。"
- Comparison Sentences:
  - "Я ценю ТТ - пистолет за его огневую мощь; его проникающая способность впечатляет меня."
  - "你将使用有限的色调和小尺寸进行像素艺术的简化和风格化设计。"

🔧 Technical Details

Training Data and Variants of StyleDistance

mStyleDistance was contrastively trained on mSynthSTEL, a synthetically generated dataset of positive and negative examples of ~40 style features being used in text in 9 non - English languages. By utilizing this synthetic dataset, mStyleDistance is able to achieve stronger content - independence than other style embedding models currently available and is able to operate on multilingual text.

📄 License

This model is released under the MIT license.

📚 Documentation

Model Information

Property	Details
Base Model	FacebookAI/xlm - roberta - base
Datasets	StyleDistance/mstyledistance_training_triplets
Library Name	sentence - transformers
Pipeline Tag	feature - extraction
Tags	datadreamer, datadreamer - 0.35.0, synthetic, sentence - transformers, feature - extraction, sentence - similarity

Citation

@misc{qiu2025mstyledistancemultilingualstyleembeddings,
      title={mStyleDistance: Multilingual Style Embeddings and their Evaluation}, 
      author={Justin Qiu and Jiacheng Zhu and Ajay Patel and Marianna Apidianaki and Chris Callison - Burch},
      year={2025},
      eprint={2502.15168},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2502.15168}, 
}

Training Details

This model was trained with a synthetic dataset with DataDreamer 🤖💤. The synthetic dataset card and model card can be found here. The training arguments can be found here.

Funding Acknowledgements

This research is supported in part by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), via the HIATUS Program contract #2022 - 22072200005. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of ODNI, IARPA, or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for governmental purposes notwithstanding any copyright annotation therein.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご