ModernCE-base-sts Open-source Semantic Similarity Model - Free evaluation of text similarity, supporting long text processing

Modernce Base Sts

Developed by dleemiller

The ModernBERT cross-encoder is a high-performance semantic similarity model specifically designed for evaluating text similarity, with support for long-context processing.

Text Classification

Safetensors

Supports Multiple LanguagesOpen Source License:MIT #Long-text semantic matching #High-precision STS evaluation #ModernBERT architecture

Downloads 351

Release Time : 1/13/2025

Model Overview

This model is based on the ModernBERT-base architecture and compares the semantic similarity of two texts through a cross-encoder approach, outputting a similarity score between 0 and 1. It is suitable for scenarios such as evaluating large language model outputs and text matching.

Model Features

High performance

Achieves Pearson coefficient 0.9162 and Spearman coefficient 0.9122 on the STS-Benchmark test set.

Efficient architecture

Designed based on ModernBERT-base (149M parameters), with faster inference speed.

Extended context length

Supports processing sequences up to 8192 tokens, making it ideal for evaluating LLM outputs.

Diverse training

Pre-trained on dleemiller/wiki-sim and fine-tuned on sentence-transformers/stsb.

Model Capabilities

Semantic similarity calculation

Text pair comparison

Long-text processing

Use Cases

Text evaluation

Large language model output evaluation

Evaluates the semantic similarity between text generated by large language models and reference texts.

Provides a similarity score between 0 and 1, helping quantify model output quality.

Text matching

Compares the semantic similarity of two texts for use in QA systems, information retrieval, and other scenarios.

Highly accurate similarity scoring improves matching effectiveness.

🚀 ModernBERT Cross-Encoder: Semantic Similarity (STS)

This project presents a cross-encoder model based on ModernBERT, which can compare two texts and output a similarity score ranging from 0 to 1. It offers high performance, efficient architecture, and extended context length, making it a great choice for evaluating LLM outputs.

🚀 Quick Start

To use ModernCE for semantic similarity tasks, you can load the model with the Hugging Face sentence-transformers library:

from sentence_transformers import CrossEncoder

# Load ModernCE model
model = CrossEncoder("dleemiller/ModernCE-base-sts")

# Predict similarity scores for sentence pairs
sentence_pairs = [
    ("It's a wonderful day outside.", "It's so sunny today!"),
    ("It's a wonderful day outside.", "He drove to work earlier."),
]
scores = model.predict(sentence_pairs)

print(scores)  # Outputs: array([0.9184, 0.0123], dtype=float32)

Output

The model returns similarity scores in the range [0, 1], where higher scores indicate stronger semantic similarity.

✨ Features

High performing: Achieves Pearson: 0.9162 and Spearman: 0.9122 on the STS-Benchmark test set.
Efficient architecture: Based on the ModernBERT-base design (149M parameters), offering faster inference speeds.
Extended context length: Processes sequences up to 8192 tokens, great for LLM output evals.
Diversified training: Pretrained on dleemiller/wiki-sim and fine-tuned on sentence-transformers/stsb.

📊 Performance

Model	STS-B Test Pearson	STS-B Test Spearman	Context Length	Parameters	Speed
`ModernCE-large-sts`	0.9256	0.9215	8192	395M	Medium
`ModernCE-base-sts`	0.9162	0.9122	8192	149M	Fast
`stsb-roberta-large`	0.9147	-	512	355M	Slow
`stsb-distilroberta-base`	0.8792	-	512	82M	Fast

🔧 Technical Details

Training Details

Pretraining

The model was pretrained on the pair-score-sampled subset of the dleemiller/wiki-sim dataset. This dataset provides diverse sentence pairs with semantic similarity scores, helping the model build a robust understanding of relationships between sentences.

Classifier Dropout: a somewhat large classifier dropout of 0.3, to reduce overreliance on teacher scores.
Objective: STS-B scores from cross-encoder/stsb-roberta-large.

Fine-Tuning

Fine-tuning was performed on the sentence-transformers/stsb dataset.

Validation Results

The model achieved the following test set performance after fine-tuning:

Pearson Correlation: 0.9162
Spearman Correlation: 0.9122

Model Card

Property	Details
Architecture	ModernBERT-base
Tokenizer	Custom tokenizer trained with modern techniques for long-context handling
Pretraining Data	`dleemiller/wiki-sim (pair-score-sampled)`
Fine-Tuning Data	`sentence-transformers/stsb`

📄 License

This model is licensed under the MIT License.

🙏 Thank You

Thanks to the AnswerAI team for providing the ModernBERT models, and the Sentence Transformers team for their leadership in transformer encoder models.

📖 Citation

If you use this model in your research, please cite:

@misc{moderncestsb2025,
  author = {Miller, D. Lee},
  title = {ModernCE STS: An STS cross encoder model},
  year = {2025},
  publisher = {Hugging Face Hub},
  url = {https://huggingface.co/dleemiller/ModernCE-base-sts},
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご