ModernBERT-embed-base-legal-MRLオープンソースモデル - 法律テキストの類似度計算と情報検索をサポート

ホーム

Modernbert Embed Base Legal MRL

AdamLucekによって開発

ModernBERTをファインチューニングした法律分野の文埋め込みモデルで、多層次元出力をサポートし、法律テキストの類似度計算や情報検索タスクに適しています。

テキスト埋め込み

Safetensors

英語オープンソースライセンス:Apache-2.0 #法律意味検索 #多層次埋め込み #長文処理

ダウンロード数 40

リリース時間 : 1/20/2025

モデル概要

これは法律分野に最適化された文埋め込みモデルで、テキストを768次元ベクトルに変換でき、多層次元出力（768/512/256/128/64次元）をサポートします。特に法律文書の意味的類似度計算、情報検索、クラスタリング分析に適しています。

モデル特徴

多層次元出力

768/512/256/128/64次元の多層埋め込み出力をサポートし、アプリケーションシナリオに応じて柔軟に次元を選択可能

法律分野最適化

法律分野の合成データでファインチューニングされており、法律テキスト処理において優れた性能を発揮

長文サポート

最大8192トークンのシーケンス長をサポートし、法律文書などの長文処理に適している

効率的な検索能力

情報検索タスク、特に法律文書検索シナリオで優れた性能を発揮

モデル能力

意味的テキスト類似度計算

意味検索

情報検索

テキストクラスタリング

特徴抽出

使用事例

法律文書処理

法律事例検索

クエリ事例に関連する法律文書を迅速に検索

テストセットで0.63の正規化割引累積ゲイン@10を達成

契約条項マッチング

契約書内の類似条項や関連内容を識別

情報検索システム

法律質問応答システム

意味検索に基づく法律質問応答システムを構築

🚀 ModernBERT Embed base Legal Matryoshka

このモデルは、nomic-ai/modernbert-embed-base を AdamLucek/legal-rag-positives-synthetic データセットでファインチューニングした sentence-transformers モデルです。文章や段落を768次元の密ベクトル空間にマッピングし、意味的な文章の類似性、意味的な検索、言い換えのマイニング、テキスト分類、クラスタリングなどに使用できます。

🚀 クイックスタート

このモデルを使用することで、文章や段落を768次元の密ベクトル空間にマッピングし、様々な自然言語処理タスクに利用できます。以下に具体的な使い方を説明します。

✨ 主な機能

文章や段落を768次元の密ベクトル空間にマッピングすることができます。
意味的な文章の類似性、意味的な検索、言い換えのマイニング、テキスト分類、クラスタリングなどのタスクに利用できます。

📦 インストール

まずは、Sentence Transformers ライブラリをインストールします。

pip install -U sentence-transformers

💻 使用例

基本的な使用法

次に、このモデルをロードして推論を実行することができます。

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("AdamLucek/ModernBERT-embed-base-legal-MRL")
# Run inference
sentences = [
    'contracting/contracting-assistance-programs/sba-mentor-protege-program (last visited Apr. 19, \n2023). \n5 \n \nprotégé must demonstrate that the added mentor-protégé relationship will not adversely affect the \ndevelopment of either protégé firm (e.g., the second firm may not be a competitor of the first \nfirm).”  13 C.F.R. § 125.9(b)(3).',
    'What must the protégé demonstrate about the mentor-protégé relationship?',
    'What discretion do district courts have regarding a defendant’s invocation of FOIA exemptions?',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

📚 ドキュメント

モデルの詳細

モデルの説明

属性	详情
モデルタイプ	Sentence Transformer
ベースモデル	nomic-ai/modernbert-embed-base
最大シーケンス長	8192トークン
出力次元数	768次元
類似度関数	コサイン類似度
学習データセット	AdamLucek/legal-rag-positives-synthetic
言語	en
ライセンス	apache-2.0

完全なモデルアーキテクチャ

SentenceTransformer(
  (0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: ModernBertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

評価

メトリクス

情報検索

データセット: dim_768, dim_512, dim_256, dim_128 および dim_64
InformationRetrievalEvaluator で評価

メトリクス	dim_768	dim_512	dim_256	dim_128	dim_64
cosine_accuracy@1	0.5286	0.5162	0.4822	0.4158	0.3122
cosine_accuracy@3	0.5719	0.5487	0.5286	0.4436	0.3509
cosine_accuracy@5	0.6646	0.6414	0.5981	0.5363	0.4359
cosine_accuracy@10	0.7311	0.7172	0.6785	0.6105	0.4791
cosine_precision@1	0.5286	0.5162	0.4822	0.4158	0.3122
cosine_precision@3	0.5142	0.4982	0.4699	0.3993	0.3091
cosine_precision@5	0.3941	0.3808	0.3586	0.3128	0.2504
cosine_precision@10	0.2329	0.2272	0.2147	0.1924	0.1498
cosine_recall@1	0.1788	0.174	0.1627	0.1426	0.105
cosine_recall@3	0.4894	0.4735	0.4493	0.3836	0.2955
cosine_recall@5	0.6121	0.5911	0.5569	0.4878	0.3931
cosine_recall@10	0.7184	0.7023	0.6642	0.5963	0.4681
cosine_ndcg@10	0.63	0.6138	0.5781	0.5109	0.3956
cosine_mrr@10	0.5741	0.5593	0.5249	0.4573	0.3509
cosine_map@100	0.6186	0.6022	0.5698	0.503	0.3939

学習の詳細

AdamLucek/legal-rag-positives-synthetic

データセット: AdamLucek/legal-rag-positives-synthetic
サイズ: 5,822個の学習サンプル
列: positive および anchor
最初の1000サンプルに基づく概算統計:
positive anchor
タイプ string string
詳細
最小: 15トークン
平均: 97.6トークン
最大: 153トークン
最小: 8トークン
平均: 16.68トークン
最大: 41トークン

	positive	anchor
タイプ	string	string
詳細	最小: 15トークン平均: 97.6トークン最大: 153トークン	最小: 8トークン平均: 16.68トークン最大: 41トークン

サンプル:

positive	anchor
`infrastructure security information,” the information at issue must, “if disclosed . . . reveal vulner- abilities in Department of Defense critical infrastructure.” 10 U.S.C. § 130e(f). The closest the Department comes is asserting that the information “individually or in the aggregate, would enable`	`What type of information must reveal vulnerabilities if disclosed?`
`they have bid.” Oral Arg. Tr. at 42:18–20. Plaintiffs also assert that, should this Court require the Polaris Solicitations to consider price at the IDIQ level, such an adjustment “adds a solicitation requirement that would ne`	`What do Plaintiffs assert about the Polaris Solicitations?`