Modernbert-embed-base-legal-matryoshka-2 Open-Source Legal Model - Supports Feature Extraction and Sentence Similarity Calculation

Modernbert Embed Base Legal Matryoshka 2

Developed by manishh16

A legal domain embedding model optimized based on the ModernBERT architecture, supporting multi-dimensional feature extraction and sentence similarity calculation

Text Embedding

Safetensors

EnglishOpen Source License:Apache-2.0 #Legal Text Similarity #Multi-dimensional Embedding #High-precision Retrieval

Downloads 14

Release Time : 3/28/2025

Model Overview

This model is a legal text embedding model optimized based on the ModernBERT architecture, specifically designed for sentence similarity calculation and feature extraction tasks in legal documents. It employs the MatryoshkaLoss training method and supports embedding representations of different dimensions.

Model Features

Multi-dimensional Embedding Support

Supports various embedding dimensions such as 768/512/256/128/64, allowing flexible selection based on application scenarios

Legal Domain Optimization

Specifically optimized for legal texts, better understanding legal terminology and document structures

Matryoshka Training Method

Uses the MatryoshkaLoss training strategy, ensuring good performance across different dimensions

Efficient Retrieval Capability

Excels in information retrieval tasks, particularly in legal document retrieval scenarios

Model Capabilities

Legal text feature extraction

Sentence similarity calculation

Information retrieval

Multi-dimensional embedding representation

Use Cases

Legal Document Processing

Legal Case Retrieval

Retrieve relevant legal cases based on query statements

Achieves 0.59 accuracy@1 at 768 dimensions

Contract Clause Matching

Match similar clauses in contracts

Achieves 0.69 accuracy@5 at 512 dimensions

Legal Research Assistance

Case Law Analysis

Analyze similar judgments in case law

Achieves 0.72 recall@10 at 256 dimensions

🚀 ModernBERT Embed base Legal Matryoshka

This model is a fine - tuned sentence - transformers model derived from [nomic - ai/modernbert - embed - base](https://huggingface.co/nomic - ai/modernbert - embed - base) on a json dataset. It maps sentences and paragraphs to a 768 - dimensional dense vector space, which can be applied in semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and other tasks.

🚀 Quick Start

First, install the Sentence Transformers library:

pip install -U sentence-transformers

Then, you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("manishh16/modernbert-embed-base-legal-matryoshka-2")
# Run inference
sentences = [
    'protests pursuant to 28 U.S.C. § 1491(b).  See 28 U.S.C. § 1491(b).  Section 1491(b)(1) grants the \n17 \n \ncourt jurisdiction over protests filed “by an interested party objecting to a solicitation by a Federal \nagency for bids or proposals for a proposed contract... or any alleged violation of statute or',
    'Under which U.S. Code section are the protests filed?',
    "Which agency's declaration is mentioned?",
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

✨ Features

Maps sentences and paragraphs to a 768 - dimensional dense vector space.
Applicable for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, etc.

📦 Installation

Install the Sentence Transformers library using the following command:

pip install -U sentence-transformers

💻 Usage Examples

Basic Usage

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("manishh16/modernbert-embed-base-legal-matryoshka-2")
# Run inference
sentences = [
    'protests pursuant to 28 U.S.C. § 1491(b).  See 28 U.S.C. § 1491(b).  Section 1491(b)(1) grants the \n17 \n \ncourt jurisdiction over protests filed “by an interested party objecting to a solicitation by a Federal \nagency for bids or proposals for a proposed contract... or any alleged violation of statute or',
    'Under which U.S. Code section are the protests filed?',
    "Which agency's declaration is mentioned?",
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

📚 Documentation

Model Details

Model Description

Property	Details
Model Type	Sentence Transformer
Base model	[nomic - ai/modernbert - embed - base](https://huggingface.co/nomic - ai/modernbert - embed - base)
Maximum Sequence Length	8192 tokens
Output Dimensionality	768 dimensions
Similarity Function	Cosine Similarity
Training Dataset	json
Language	en
License	apache - 2.0

Model Sources

Documentation: Sentence Transformers Documentation
Repository: [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence - transformers)
Hugging Face: [Sentence Transformers on Hugging Face](https://huggingface.co/models?library = sentence - transformers)

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: ModernBertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Evaluation

Information Retrieval (dim_768)

Dataset: dim_768
Evaluated with InformationRetrievalEvaluator with these parameters:
```
{
    "truncate_dim": 768
}
```

Metric	Value
cosine_accuracy@1	0.592
cosine_accuracy@3	0.6352
cosine_accuracy@5	0.7032
cosine_accuracy@10	0.7666
cosine_precision@1	0.592
cosine_precision@3	0.5683
cosine_precision@5	0.4263
cosine_precision@10	0.2408
cosine_recall@1	0.2012
cosine_recall@3	0.547
cosine_recall@5	0.6664
cosine_recall@10	0.7508
cosine_ndcg@10	0.6774
cosine_mrr@10	0.6317
cosine_map@100	0.6707

Information Retrieval (dim_512)

Dataset: dim_512
Evaluated with InformationRetrievalEvaluator with these parameters:
```
{
    "truncate_dim": 512
}
```

Metric	Value
cosine_accuracy@1	0.5858
cosine_accuracy@3	0.6167
cosine_accuracy@5	0.6909
cosine_accuracy@10	0.7666
cosine_precision@1	0.5858
cosine_precision@3	0.5574
cosine_precision@5	0.4176
cosine_precision@10	0.2417
cosine_recall@1	0.1984
cosine_recall@3	0.5353
cosine_recall@5	0.6515
cosine_recall@10	0.7518
cosine_ndcg@10	0.6722
cosine_mrr@10	0.6236
cosine_map@100	0.662

Information Retrieval (dim_256)

Dataset: dim_256
Evaluated with InformationRetrievalEvaluator with these parameters:
```
{
    "truncate_dim": 256
}
```

Metric	Value
cosine_accuracy@1	0.5672
cosine_accuracy@3	0.5873
cosine_accuracy@5	0.6646
cosine_accuracy@10	0.7311
cosine_precision@1	0.5672
cosine_precision@3	0.5384
cosine_precision@5	0.4009
cosine_precision@10	0.2308
cosine_recall@1	0.1906
cosine_recall@3	0.5152
cosine_recall@5	0.6264
cosine_recall@10	0.7205
cosine_ndcg@10	0.6454
cosine_mrr@10	0.6009
cosine_map@100	0.6377

Information Retrieval (dim_128)

Dataset: dim_128
Evaluated with InformationRetrievalEvaluator with these parameters:
```
{
    "truncate_dim": 128
}
```

Metric	Value
cosine_accuracy@1	0.4992
cosine_accuracy@3	0.5301
cosine_accuracy@5	0.6136
cosine_accuracy@10	0.6785
cosine_precision@1	0.4992
cosine_precision@3	0.4745
cosine_precision@5	0.3654
cosine_precision@10	0.2159
cosine_recall@1	0.1695
cosine_recall@3	0.4581
cosine_recall@5	0.5706
cosine_recall@10	0.6683
cosine_ndcg@10	0.5892
cosine_mrr@10	0.5386
cosine_map@100	0.5783

Information Retrieval (dim_64)

Dataset: dim_64
Evaluated with InformationRetrievalEvaluator with these parameters:
```
{
    "truncate_dim": 64
}
```

Metric	Value
cosine_accuracy@1	0.3632
cosine_accuracy@3	0.4019
cosine_accuracy@5	0.473
cosine_accuracy@10	0.527
cosine_precision@1	0.3632
cosine_precision@3	0.3514
cosine_precision@5	0.2782
cosine_precision@10	0.1651
cosine_recall@1	0.1236
cosine_recall@3	0.3391
cosine_recall@5	0.4364
cosine_recall@10	0.5143
cosine_ndcg@10	0.4444
cosine_mrr@10	0.4003
cosine_map@100	0.4462

Training Details

Training Dataset

json

Dataset: json
Size: 5,822 training samples
Columns: positive and anchor

📄 License

The license of this model is apache - 2.0.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご