🚀 ModernBERT Embed base Legal Matryoshka
This is a Sentence Transformers model fine - tuned from nomic - ai/modernbert - embed - base
on the AdamLucek/legal - rag - positives - synthetic
dataset. It maps sentences and paragraphs to a 768 - dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, etc.
🚀 Quick Start
This model is a sentence - transformers model finetuned from [nomic - ai/modernbert - embed - base](https://huggingface.co/nomic - ai/modernbert - embed - base) on the [AdamLucek/legal - rag - positives - synthetic](https://huggingface.co/datasets/AdamLucek/legal - rag - positives - synthetic) dataset. It can map sentences and paragraphs to a 768 - dimensional dense vector space and is applicable for various natural language processing tasks such as semantic textual similarity, semantic search, paraphrase mining, text classification, and clustering.
✨ Features
- Maps text to a 768 - dimensional dense vector space.
- Can be used for multiple natural language processing tasks including semantic similarity, search, and classification.
📦 Installation
First, you need to install the Sentence Transformers library:
pip install -U sentence-transformers
💻 Usage Examples
Basic Usage
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("AdamLucek/ModernBERT-embed-base-legal-MRL")
sentences = [
'contracting/contracting-assistance-programs/sba-mentor-protege-program (last visited Apr. 19, \n2023). \n5 \n \nprotégé must demonstrate that the added mentor-protégé relationship will not adversely affect the \ndevelopment of either protégé firm (e.g., the second firm may not be a competitor of the first \nfirm).” 13 C.F.R. § 125.9(b)(3).',
'What must the protégé demonstrate about the mentor-protégé relationship?',
'What discretion do district courts have regarding a defendant’s invocation of FOIA exemptions?',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
📚 Documentation
Model Details
Model Description
Property |
Details |
Model Type |
Sentence Transformer |
Base model |
[nomic - ai/modernbert - embed - base](https://huggingface.co/nomic - ai/modernbert - embed - base) |
Maximum Sequence Length |
8192 tokens |
Output Dimensionality |
768 dimensions |
Similarity Function |
Cosine Similarity |
Training Dataset |
[AdamLucek/legal - rag - positives - synthetic](https://huggingface.co/datasets/AdamLucek/legal - rag - positives - synthetic) |
Language |
en |
License |
apache - 2.0 |
Full Model Architecture
SentenceTransformer(
(0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: ModernBertModel
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
(2): Normalize()
)
Evaluation
Metrics
Information Retrieval
Metric |
dim_768 |
dim_512 |
dim_256 |
dim_128 |
dim_64 |
cosine_accuracy@1 |
0.5286 |
0.5162 |
0.4822 |
0.4158 |
0.3122 |
cosine_accuracy@3 |
0.5719 |
0.5487 |
0.5286 |
0.4436 |
0.3509 |
cosine_accuracy@5 |
0.6646 |
0.6414 |
0.5981 |
0.5363 |
0.4359 |
cosine_accuracy@10 |
0.7311 |
0.7172 |
0.6785 |
0.6105 |
0.4791 |
cosine_precision@1 |
0.5286 |
0.5162 |
0.4822 |
0.4158 |
0.3122 |
cosine_precision@3 |
0.5142 |
0.4982 |
0.4699 |
0.3993 |
0.3091 |
cosine_precision@5 |
0.3941 |
0.3808 |
0.3586 |
0.3128 |
0.2504 |
cosine_precision@10 |
0.2329 |
0.2272 |
0.2147 |
0.1924 |
0.1498 |
cosine_recall@1 |
0.1788 |
0.174 |
0.1627 |
0.1426 |
0.105 |
cosine_recall@3 |
0.4894 |
0.4735 |
0.4493 |
0.3836 |
0.2955 |
cosine_recall@5 |
0.6121 |
0.5911 |
0.5569 |
0.4878 |
0.3931 |
cosine_recall@10 |
0.7184 |
0.7023 |
0.6642 |
0.5963 |
0.4681 |
cosine_ndcg@10 |
0.63 |
0.6138 |
0.5781 |
0.5109 |
0.3956 |
cosine_mrr@10 |
0.5741 |
0.5593 |
0.5249 |
0.4573 |
0.3509 |
cosine_map@100 |
0.6186 |
0.6022 |
0.5698 |
0.503 |
0.3939 |
Training Details
[AdamLucek/legal - rag - positives - synthetic](https://huggingface.co/datasets/AdamLucek/legal - rag - positives - synthetic)
- Dataset: [AdamLucek/legal - rag - positives - synthetic](https://huggingface.co/datasets/AdamLucek/legal - rag - positives - synthetic)
- Size: 5,822 training samples
- Columns:
positive
and anchor
- Approximate statistics based on the first 1000 samples:
|
positive |
anchor |
type |
string |
string |
details |
- min: 15 tokens
- mean: 97.6 tokens
- max: 153 tokens
|
- min: 8 tokens
- mean: 16.68 tokens
- max: 41 tokens
|
- Samples:
positive |
anchor |
infrastructure security information,” the information at issue must, “if disclosed . . . reveal vulner - abilities in Department of Defense critical infrastructure.” 10 U.S.C. § 130e(f). The closest the Department comes is asserting that the information “individually or in the aggregate, would enable |
What type of information must reveal vulnerabilities if disclosed? |
they have bid.” Oral Arg. Tr. at 42:18–20. Plaintiffs also assert that, should this Court require the Polaris Solicitations to consider price at the IDIQ level, such an adjustment “adds a solicitation requirement that would ne |
What do plaintiffs assert about the Polaris Solicitations? |
📄 License
This model is licensed under the apache - 2.0 license.