StyleDistance Open-Source Style Embedding Model – Bring Texts with Similar Writing Styles Closer and Separate Texts with Different Styles

Styledistance

Developed by StyleDistance

StyleDistance is a style embedding model designed to closely embed texts with similar writing styles and distance those with different styles, unaffected by content.

Text Embedding

Safetensors

EnglishOpen Source License:MIT #Style Embedding #Content-Agnostic Style Analysis #Authorship Identification

Downloads 492

Release Time : 7/17/2024

Model Overview

This model can be used for text style analysis, clustering, authorship identification and verification tasks, as well as automatic style transfer evaluation.

Model Features

Content-Agnostic Style Embedding

Capable of closely embedding texts with similar writing styles and distancing those with different styles, unaffected by content.

Synthetic Data Training

Trained contrastively on the SynthSTEL synthetic dataset, which includes positive and negative samples of 40 style features in texts.

Strong Style Analysis Capability

Compared to existing style embedding models, it achieves stronger content independence and is suitable for various style-related tasks.

Model Capabilities

Text Style Analysis

Text Clustering

Authorship Identification

Authorship Verification

Automatic Style Transfer Evaluation

Use Cases

Text Analysis

Authorship Identification

Identify the author of a text by analyzing its stylistic features.

Style Transfer Evaluation

Evaluate the effectiveness of automatic style transfer systems by comparing style differences before and after conversion.

Education Research

Writing Style Analysis

Analyze changes in students' writing styles to provide personalized writing guidance.

🚀 StyleDistance - Sentence Similarity Model

This project presents StyleDistance, a sentence - similarity model that focuses on style embeddings. It can effectively embed texts with similar writing styles closely and different styles far apart, regardless of the content. This model is highly useful for tasks such as stylistic analysis, clustering, authorship identification, and automatic style - transfer evaluation.

📦 Model Information

Property	Details
Base Model	FacebookAI/roberta - base
Datasets	- SynthSTEL/styledistance_training_triplets - StyleDistance/synthstel
Language	en
Library Name	sentence - transformers
License	mit
Pipeline Tag	sentence - similarity
Tags	datadreamer, datadreamer - 0.35.0, synthetic, sentence - transformers, feature - extraction, sentence - similarity

🚀 Quick Start

This repository contains the model introduced in StyleDistance: Stronger Content - Independent Style Embeddings with Synthetic Parallel Examples.

StyleDistance is a style embedding model aiming to embed texts with similar writing styles closely and different styles far apart, regardless of content. It can be beneficial for stylistic analysis of text, clustering, authorship identification and verification tasks, and automatic style transfer evaluation.

✨ Features

Training Data and Variants of StyleDistance

StyleDistance was contrastively trained on SynthSTEL, a synthetically generated dataset of positive and negative examples of 40 style features being used in text. By using this synthetic dataset, StyleDistance can achieve stronger content - independence than other current style embedding models. This particular model was trained using a combination of the synthetic dataset and a [real dataset that makes use of authorship datasets from Reddit to train style embeddings](https://aclanthology.org/2022.repl4nlp - 1.26/). For a version purely trained on synthetic data, see this other version of StyleDistance.

💻 Usage Examples

Basic Usage

from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim

model = SentenceTransformer('StyleDistance/styledistance') # Load model

input = model.encode("Did you hear about the Wales wing? He'll h8 2 withdraw due 2 injuries from future competitions.")
others = model.encode(["We're raising funds 2 improve our school's storage facilities and add new playground equipment!", "Did you hear about the Wales wing? He'll hate to withdraw due to injuries from future competitions."])
print(cos_sim(input, others))

Widget Examples

Example 1
- Source Sentence: Did you hear about the Wales wing? He'll h8 2 withdraw due 2 injuries from future competitions.
- Comparison Sentences:
  - We're raising funds 2 improve our school's storage facilities and add new playground equipment!
  - Did you hear about the Wales wing? He'll hate to withdraw due to injuries from future competitions.
Example 2
- Source Sentence: You planned the DesignMeets Decades of Design event; you executed it perfectly.
- Comparison Sentences:
  - We'll find it hard to prove the thief didn't face a real threat!
  - You orchestrated the DesignMeets Decades of Design gathering; you actualized it flawlessly.
Example 3
- Source Sentence: Did the William Barr maintain a commitment to allow Robert Mueller to finish the inquiry?
- Comparison Sentences:
  - Will the artist be compiling a music album, or will there be a different focus in the future?
  - Did William Barr maintain commitment to allow Robert Mueller to finish inquiry?

📄 License

The model is released under the MIT license.

📚 Citation

@misc{patel2025styledistancestrongercontentindependentstyle,
      title={StyleDistance: Stronger Content - Independent Style Embeddings with Synthetic Parallel Examples}, 
      author={Ajay Patel and Jiacheng Zhu and Justin Qiu and Zachary Horvitz and Marianna Apidianaki and Kathleen McKeown and Chris Callison - Burch},
      year={2025},
      eprint={2410.12757},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2410.12757}, 
}

🔗 Trained with DataDreamer

This model was trained with a synthetic dataset with DataDreamer 🤖💤. The synthetic dataset card and model card can be found here. The training arguments can be found here.

💸 Funding Acknowledgements

This research is supported in part by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), via the HIATUS Program contract #2022 - 22072200005. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of ODNI, IARPA, or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for governmental purposes notwithstanding any copyright annotation therein.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご