🚀 ModernBERT Embed base Legal Matryoshka
This model is a fine - tuned sentence - transformers model derived from [nomic - ai/modernbert - embed - base](https://huggingface.co/nomic - ai/modernbert - embed - base) on a json dataset. It maps sentences and paragraphs to a 768 - dimensional dense vector space, which can be applied in semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and other tasks.
🚀 Quick Start
First, install the Sentence Transformers library:
pip install -U sentence-transformers
Then, you can load this model and run inference.
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("manishh16/modernbert-embed-base-legal-matryoshka-2")
sentences = [
'protests pursuant to 28 U.S.C. § 1491(b). See 28 U.S.C. § 1491(b). Section 1491(b)(1) grants the \n17 \n \ncourt jurisdiction over protests filed “by an interested party objecting to a solicitation by a Federal \nagency for bids or proposals for a proposed contract... or any alleged violation of statute or',
'Under which U.S. Code section are the protests filed?',
"Which agency's declaration is mentioned?",
]
embeddings = model.encode(sentences)
print(embeddings.shape)
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
✨ Features
- Maps sentences and paragraphs to a 768 - dimensional dense vector space.
- Applicable for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, etc.
📦 Installation
Install the Sentence Transformers library using the following command:
pip install -U sentence-transformers
💻 Usage Examples
Basic Usage
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("manishh16/modernbert-embed-base-legal-matryoshka-2")
sentences = [
'protests pursuant to 28 U.S.C. § 1491(b). See 28 U.S.C. § 1491(b). Section 1491(b)(1) grants the \n17 \n \ncourt jurisdiction over protests filed “by an interested party objecting to a solicitation by a Federal \nagency for bids or proposals for a proposed contract... or any alleged violation of statute or',
'Under which U.S. Code section are the protests filed?',
"Which agency's declaration is mentioned?",
]
embeddings = model.encode(sentences)
print(embeddings.shape)
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
📚 Documentation
Model Details
Model Description
Property |
Details |
Model Type |
Sentence Transformer |
Base model |
[nomic - ai/modernbert - embed - base](https://huggingface.co/nomic - ai/modernbert - embed - base) |
Maximum Sequence Length |
8192 tokens |
Output Dimensionality |
768 dimensions |
Similarity Function |
Cosine Similarity |
Training Dataset |
json |
Language |
en |
License |
apache - 2.0 |
Model Sources
- Documentation: Sentence Transformers Documentation
- Repository: [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence - transformers)
- Hugging Face: [Sentence Transformers on Hugging Face](https://huggingface.co/models?library = sentence - transformers)
Full Model Architecture
SentenceTransformer(
(0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: ModernBertModel
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
(2): Normalize()
)
Evaluation
Information Retrieval (dim_768)
Metric |
Value |
cosine_accuracy@1 |
0.592 |
cosine_accuracy@3 |
0.6352 |
cosine_accuracy@5 |
0.7032 |
cosine_accuracy@10 |
0.7666 |
cosine_precision@1 |
0.592 |
cosine_precision@3 |
0.5683 |
cosine_precision@5 |
0.4263 |
cosine_precision@10 |
0.2408 |
cosine_recall@1 |
0.2012 |
cosine_recall@3 |
0.547 |
cosine_recall@5 |
0.6664 |
cosine_recall@10 |
0.7508 |
cosine_ndcg@10 |
0.6774 |
cosine_mrr@10 |
0.6317 |
cosine_map@100 |
0.6707 |
Information Retrieval (dim_512)
Metric |
Value |
cosine_accuracy@1 |
0.5858 |
cosine_accuracy@3 |
0.6167 |
cosine_accuracy@5 |
0.6909 |
cosine_accuracy@10 |
0.7666 |
cosine_precision@1 |
0.5858 |
cosine_precision@3 |
0.5574 |
cosine_precision@5 |
0.4176 |
cosine_precision@10 |
0.2417 |
cosine_recall@1 |
0.1984 |
cosine_recall@3 |
0.5353 |
cosine_recall@5 |
0.6515 |
cosine_recall@10 |
0.7518 |
cosine_ndcg@10 |
0.6722 |
cosine_mrr@10 |
0.6236 |
cosine_map@100 |
0.662 |
Information Retrieval (dim_256)
Metric |
Value |
cosine_accuracy@1 |
0.5672 |
cosine_accuracy@3 |
0.5873 |
cosine_accuracy@5 |
0.6646 |
cosine_accuracy@10 |
0.7311 |
cosine_precision@1 |
0.5672 |
cosine_precision@3 |
0.5384 |
cosine_precision@5 |
0.4009 |
cosine_precision@10 |
0.2308 |
cosine_recall@1 |
0.1906 |
cosine_recall@3 |
0.5152 |
cosine_recall@5 |
0.6264 |
cosine_recall@10 |
0.7205 |
cosine_ndcg@10 |
0.6454 |
cosine_mrr@10 |
0.6009 |
cosine_map@100 |
0.6377 |
Information Retrieval (dim_128)
Metric |
Value |
cosine_accuracy@1 |
0.4992 |
cosine_accuracy@3 |
0.5301 |
cosine_accuracy@5 |
0.6136 |
cosine_accuracy@10 |
0.6785 |
cosine_precision@1 |
0.4992 |
cosine_precision@3 |
0.4745 |
cosine_precision@5 |
0.3654 |
cosine_precision@10 |
0.2159 |
cosine_recall@1 |
0.1695 |
cosine_recall@3 |
0.4581 |
cosine_recall@5 |
0.5706 |
cosine_recall@10 |
0.6683 |
cosine_ndcg@10 |
0.5892 |
cosine_mrr@10 |
0.5386 |
cosine_map@100 |
0.5783 |
Information Retrieval (dim_64)
Metric |
Value |
cosine_accuracy@1 |
0.3632 |
cosine_accuracy@3 |
0.4019 |
cosine_accuracy@5 |
0.473 |
cosine_accuracy@10 |
0.527 |
cosine_precision@1 |
0.3632 |
cosine_precision@3 |
0.3514 |
cosine_precision@5 |
0.2782 |
cosine_precision@10 |
0.1651 |
cosine_recall@1 |
0.1236 |
cosine_recall@3 |
0.3391 |
cosine_recall@5 |
0.4364 |
cosine_recall@10 |
0.5143 |
cosine_ndcg@10 |
0.4444 |
cosine_mrr@10 |
0.4003 |
cosine_map@100 |
0.4462 |
Training Details
Training Dataset
json
- Dataset: json
- Size: 5,822 training samples
- Columns:
positive
and anchor
📄 License
The license of this model is apache - 2.0.