ModernBERT-embed-base-legal-MRL Open-Source Model - Empowering Legal Text Similarity Calculation and Information Retrieval

Modernbert Embed Base Legal MRL

Developed by AdamLucek

A legal domain sentence embedding model fine-tuned based on ModernBERT, supporting multi-level dimensional output, suitable for legal text similarity calculation and information retrieval tasks.

Text Embedding

Safetensors

EnglishOpen Source License:Apache-2.0 #Legal Semantic Retrieval #Multi-level Embedding #Long Text Processing

Downloads 40

Release Time : 1/20/2025

Model Overview

This is a sentence embedding model optimized for the legal domain, capable of converting text into 768-dimensional vectors, supporting multi-level dimensional output (768/512/256/128/64 dimensions), particularly suitable for semantic similarity calculation, information retrieval, and clustering analysis of legal documents.

Model Features

Multi-level Dimensional Output

Supports 768/512/256/128/64-dimensional multi-level embedding output, allowing flexible dimension selection based on application scenarios

Legal Domain Optimization

Fine-tuned with synthetic legal domain data, excelling in processing legal texts

Long Text Support

Supports sequences up to 8192 tokens, ideal for processing long texts such as legal documents

Efficient Retrieval Capability

Performs exceptionally well in information retrieval tasks, especially in legal document retrieval scenarios

Model Capabilities

Semantic text similarity calculation

Semantic search

Information retrieval

Text clustering

Feature extraction

Use Cases

Legal Document Processing

Legal Case Retrieval

Quickly retrieve legal documents related to the query case

Achieved a normalized discounted cumulative gain@10 of 0.63 on the test set

Contract Clause Matching

Identify similar clauses and related content in contracts

Information Retrieval Systems

Legal Q&A System

Build a semantic retrieval-based legal Q&A system

🚀 ModernBERT Embed base Legal Matryoshka

This is a Sentence Transformers model fine - tuned from nomic - ai/modernbert - embed - base on the AdamLucek/legal - rag - positives - synthetic dataset. It maps sentences and paragraphs to a 768 - dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, etc.

🚀 Quick Start

This model is a sentence - transformers model finetuned from [nomic - ai/modernbert - embed - base](https://huggingface.co/nomic - ai/modernbert - embed - base) on the [AdamLucek/legal - rag - positives - synthetic](https://huggingface.co/datasets/AdamLucek/legal - rag - positives - synthetic) dataset. It can map sentences and paragraphs to a 768 - dimensional dense vector space and is applicable for various natural language processing tasks such as semantic textual similarity, semantic search, paraphrase mining, text classification, and clustering.

✨ Features

Maps text to a 768 - dimensional dense vector space.
Can be used for multiple natural language processing tasks including semantic similarity, search, and classification.

📦 Installation

First, you need to install the Sentence Transformers library:

pip install -U sentence-transformers

💻 Usage Examples

Basic Usage

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("AdamLucek/ModernBERT-embed-base-legal-MRL")
# Run inference
sentences = [
    'contracting/contracting-assistance-programs/sba-mentor-protege-program (last visited Apr. 19, \n2023). \n5 \n \nprotégé must demonstrate that the added mentor-protégé relationship will not adversely affect the \ndevelopment of either protégé firm (e.g., the second firm may not be a competitor of the first \nfirm).”  13 C.F.R. § 125.9(b)(3).',
    'What must the protégé demonstrate about the mentor-protégé relationship?',
    'What discretion do district courts have regarding a defendant’s invocation of FOIA exemptions?',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

📚 Documentation

Model Details

Model Description

Property	Details
Model Type	Sentence Transformer
Base model	[nomic - ai/modernbert - embed - base](https://huggingface.co/nomic - ai/modernbert - embed - base)
Maximum Sequence Length	8192 tokens
Output Dimensionality	768 dimensions
Similarity Function	Cosine Similarity
Training Dataset	[AdamLucek/legal - rag - positives - synthetic](https://huggingface.co/datasets/AdamLucek/legal - rag - positives - synthetic)
Language	en
License	apache - 2.0

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: ModernBertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Evaluation

Metrics

Information Retrieval

Datasets: dim_768, dim_512, dim_256, dim_128 and dim_64
Evaluated with InformationRetrievalEvaluator

Metric	dim_768	dim_512	dim_256	dim_128	dim_64
cosine_accuracy@1	0.5286	0.5162	0.4822	0.4158	0.3122
cosine_accuracy@3	0.5719	0.5487	0.5286	0.4436	0.3509
cosine_accuracy@5	0.6646	0.6414	0.5981	0.5363	0.4359
cosine_accuracy@10	0.7311	0.7172	0.6785	0.6105	0.4791
cosine_precision@1	0.5286	0.5162	0.4822	0.4158	0.3122
cosine_precision@3	0.5142	0.4982	0.4699	0.3993	0.3091
cosine_precision@5	0.3941	0.3808	0.3586	0.3128	0.2504
cosine_precision@10	0.2329	0.2272	0.2147	0.1924	0.1498
cosine_recall@1	0.1788	0.174	0.1627	0.1426	0.105
cosine_recall@3	0.4894	0.4735	0.4493	0.3836	0.2955
cosine_recall@5	0.6121	0.5911	0.5569	0.4878	0.3931
cosine_recall@10	0.7184	0.7023	0.6642	0.5963	0.4681
cosine_ndcg@10	0.63	0.6138	0.5781	0.5109	0.3956
cosine_mrr@10	0.5741	0.5593	0.5249	0.4573	0.3509
cosine_map@100	0.6186	0.6022	0.5698	0.503	0.3939

Training Details

[AdamLucek/legal - rag - positives - synthetic](https://huggingface.co/datasets/AdamLucek/legal - rag - positives - synthetic)

Dataset: [AdamLucek/legal - rag - positives - synthetic](https://huggingface.co/datasets/AdamLucek/legal - rag - positives - synthetic)
Size: 5,822 training samples
Columns: positive and anchor
Approximate statistics based on the first 1000 samples:
positive anchor
type string string
details
min: 15 tokens
mean: 97.6 tokens
max: 153 tokens
min: 8 tokens
mean: 16.68 tokens
max: 41 tokens

	positive	anchor
type	string	string
details	min: 15 tokens mean: 97.6 tokens max: 153 tokens	min: 8 tokens mean: 16.68 tokens max: 41 tokens

Samples:

positive	anchor
`infrastructure security information,” the information at issue must, “if disclosed . . . reveal vulner - abilities in Department of Defense critical infrastructure.” 10 U.S.C. § 130e(f). The closest the Department comes is asserting that the information “individually or in the aggregate, would enable`	`What type of information must reveal vulnerabilities if disclosed?`
`they have bid.” Oral Arg. Tr. at 42:18–20. Plaintiffs also assert that, should this Court require the Polaris Solicitations to consider price at the IDIQ level, such an adjustment “adds a solicitation requirement that would ne`	`What do plaintiffs assert about the Polaris Solicitations?`

📄 License

This model is licensed under the apache - 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご