nomic-embed-text-v2-moe-msmarco-bprオープンソースモデル - 無料で意味的なテキストの類似度を計算する

Nomic Embed Text V2 Moe Msmarco Bpr

BlackBeenieによって開発

これはnomic-ai/nomic-embed-text-v2-moeから微調整されたsentence-transformersモデルで、テキストを768次元の密なベクトル空間にマッピングし、意味テキスト類似度計算などのタスクに使用できます。

テキスト埋め込み

Safetensors

#长テキスト埋め込み #意味検索最適化 #BPR損失微調整

ダウンロード数 41

リリース時間 : 3/4/2025

モデル概要

このモデルは文や段落を768次元の密なベクトル空間にマッピングし、意味テキスト類似度計算、意味検索、复述マイニング、テキスト分類、クラスタリングなどのタスクに使用できます。

モデル特徴

長テキスト処理能力

最大8192トークンのシーケンス長をサポートし、長いテキスト内容の処理に適しています。

効率的な意味エンコーディング

テキストを768次元の密なベクトル空間にマッピングし、豊富な意味情報を保持します。

微調整最適化

nomic-ai/nomic-embed-text-v2-moeモデルをベースに微調整され、意味類似度タスクの性能が最適化されています。

モデル能力

意味テキスト類似度計算

意味検索

复述マイニング

テキスト分類

テキストクラスタリング

使用事例

情報検索

類似質問のマッチング

質問応答システムで意味的に類似した質問をマッチングする

異なる表現であっても意味が同じ質問を正確に識別できます

コンテンツ管理

ドキュメントの重複排除

意味的に類似したドキュメント内容を識別する

重複するコンテンツの保存を効果的に削減できます

🚀 ノミック-エンベッド-テキスト-v2-moeに基づくSentenceTransformer

このモデルはnomic-ai/nomic-embed-text-v2-moeをファインチューニングしたsentence-transformersモデルです。文章や段落を768次元の密ベクトル空間にマッピングし、意味的な文章の類似性、意味的な検索、言い換えのマイニング、テキスト分類、クラスタリングなどに使用できます。

🚀 クイックスタート

このモデルを使用するには、まずSentence Transformersライブラリをインストールします。

pip install -U sentence-transformers

次に、このモデルをロードして推論を実行できます。

from sentence_transformers import SentenceTransformer

# 🤗 Hubからダウンロード
model = SentenceTransformer("BlackBeenie/nomic-embed-text-v2-moe-msmarco-bpr")
# 推論を実行
sentences = [
    'what services are offered by adult day care',
    'Consumer Guide to Long Term Care. Adult Day Care. Adult day care is a planned program offered in a group setting that provides services that improve or maintain health or functioning, and social activities for seniors and persons with disabilities.',
    'The Met Life Market survey of 2008 on adult day services states the average cost for adult day care services is $64 per day. There has been an increase of 5% in these services in the past year.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# 埋め込みベクトルの類似度スコアを取得
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

✨ 主な機能

文章や段落を768次元の密ベクトル空間にマッピングすることができます。
意味的な文章の類似性、意味的な検索、言い換えのマイニング、テキスト分類、クラスタリングなどに使用できます。

📦 インストール

pip install -U sentence-transformers

💻 使用例

基本的な使用法

from sentence_transformers import SentenceTransformer

# 🤗 Hubからダウンロード
model = SentenceTransformer("BlackBeenie/nomic-embed-text-v2-moe-msmarco-bpr")
# 推論を実行
sentences = [
    'what services are offered by adult day care',
    'Consumer Guide to Long Term Care. Adult Day Care. Adult day care is a planned program offered in a group setting that provides services that improve or maintain health or functioning, and social activities for seniors and persons with disabilities.',
    'The Met Life Market survey of 2008 on adult day services states the average cost for adult day care services is $64 per day. There has been an increase of 5% in these services in the past year.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# 埋め込みベクトルの類似度スコアを取得
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

📚 詳細ドキュメント

モデルの詳細

モデルの説明

属性	详情
モデルタイプ	Sentence Transformer
ベースモデル	nomic-ai/nomic-embed-text-v2-moe
最大シーケンス長	8192トークン
出力次元数	768次元
類似度関数	コサイン類似度

モデルのソース

ドキュメント：Sentence Transformers Documentation
リポジトリ：Sentence Transformers on GitHub
Hugging Face：Sentence Transformers on Hugging Face

完全なモデルアーキテクチャ

SentenceTransformer(
  (0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: NomicBertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

トレーニングの詳細

トレーニングデータセット

無名データセット

サイズ：498,970個のトレーニングサンプル
列：sentence_0、sentence_1、sentence_2

最初の1000サンプルに基づく概算統計：

	sentence_0	sentence_1	sentence_2
タイプ	文字列	文字列	文字列
詳細	最小: 4トークン平均: 9.75トークン最大: 24トークン	最小: 24トークン平均: 89.23トークン最大: 241トークン	最小: 20トークン平均: 86.66トークン最大: 280トークン

サンプル：

sentence_0	sentence_1	sentence_2
`what the history of bluetooth`	`When asked about the name Bluetooth, I explained that Bluetooth was borrowed from the 10th century, second King of Denmark, King Harald Bluetooth; who was famous for uniting Scandinavia just as we intended to unite the PC and cellular industries with a short-range wireless link.`	`Technology: 1 How secure is a Bluetooth network? 2 What is Frequency-Hopping Spread Spectrum (FHSS)? 3 Will other RF (Radio Frequency) devices interfere with Bluetooth Devices? 4 Will Bluetooth and Wireless LAN (WLAN) interfere with each other? 5 What is the data throughput speed of a Bluetooth connection? 6 What is the range of Bluetooth 7 ... What kind of ...`
`how thin can a concrete slab be`	`Another issue that must be addressed is the added weight of the thin-slab. Poured gypsum thin-slabs typically add 13 to 15 pounds per square foot to the dead loading of a floor structure. Standard weight concrete thin slabs add about 18 pounds per square foot (at 1.5 thickness).`	`Find the Area in square feet: We will use a concrete slab pour for our example. Letâs say that we need to figure out the yardage for a slab that will be 15 feet long by 10 feet wide and 4 inches thick. First we find the area by multiplying the length times the width. 1 15 feet X 10 feet = 150 square feet.`
`how long to cook eggs to hard boil`	`This method works best if the eggs are in a single layer, but you can double them up as well, you'll just need to add more time to the steaming time. 3 Set your timer for 6 minutes for soft boiled, 10 minutes for hard boiled with a still translucent and bright yolk, or 12-15 minutes for cooked-through hard boiled.`	`Hard-Steamed Eggs. Fill a pot that can comfortably hold your steamer with the lid on with 1 to 2 inches of water. Bring to a rolling boil, 212 degrees Fahrenheit. Place your eggs in a metal steamer, and lower the basket into the pot. The eggs should sit above the boiling water. Cover and cook for 12 minutes. Hard-steamed eggs, like hard-boiled eggs, are eggs that are cooked until the egg yolk is fully set and has turned to a chalky texture.`

損失関数：beir.losses.bpr_loss.BPRLoss

トレーニングのハイパーパラメータ

非デフォルトのハイパーパラメータ

eval_strategy: steps
per_device_train_batch_size: 32
per_device_eval_batch_size: 32
num_train_epochs: 5
fp16: True
multi_dataset_batch_sampler: round_robin

すべてのハイパーパラメータ

クリックして展開

overwrite_output_dir: False
do_predict: False
eval_strategy: steps
prediction_loss_only: True
per_device_train_batch_size: 32
per_device_eval_batch_size: 32
per_gpu_train_batch_size: None
per_gpu_eval_batch_size: None
gradient_accumulation_steps: 1
eval_accumulation_steps: None
torch_empty_cache_steps: None
learning_rate: 5e-05
weight_decay: 0.0
adam_beta1: 0.9
adam_beta2: 0.999
adam_epsilon: 1e-08
max_grad_norm: 1
num_train_epochs: 5
max_steps: -1
lr_scheduler_type: linear
lr_scheduler_kwargs: {}
warmup_ratio: 0.0
warmup_steps: 0
log_level: passive
log_level_replica: warning
log_on_each_node: True
logging_nan_inf_filter: True
save_safetensors: True
save_on_each_node: False
save_only_model: False
restore_callback_states_from_checkpoint: False
no_cuda: False
use_cpu: False
use_mps_device: False
seed: 42
data_seed: None
jit_mode_eval: False
use_ipex: False
bf16: False
fp16: True
fp16_opt_level: O1
half_precision_backend: auto
bf16_full_eval: False
fp16_full_eval: False
tf32: None
local_rank: 0
ddp_backend: None
tpu_num_cores: None
tpu_metrics_debug: False
debug: []
dataloader_drop_last: False
dataloader_num_workers: 0
dataloader_prefetch_factor: None
past_index: -1
disable_tqdm: False
remove_unused_columns: True
label_names: None
load_best_model_at_end: False
ignore_data_skip: False
fsdp: []
fsdp_min_num_params: 0
fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
fsdp_transformer_layer_cls_to_wrap: None
accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
deepspeed: None
label_smoothing_factor: 0.0
optim: adamw_torch
optim_args: None
adafactor: False
group_by_length: False
length_column_name: length
ddp_find_unused_parameters: None
ddp_bucket_cap_mb: None
ddp_broadcast_buffers: False
dataloader_pin_memory: True
dataloader_persistent_workers: False
skip_memory_metrics: True
use_legacy_prediction_loop: False
push_to_hub: False
resume_from_checkpoint: None
hub_model_id: None
hub_strategy: every_save
hub_private_repo: None
hub_always_push: False
gradient_checkpointing: False
gradient_checkpointing_kwargs: None
include_inputs_for_metrics: False
include_for_metrics: []
eval_do_concat_batches: True
fp16_backend: auto
push_to_hub_model_id: None
push_to_hub_organization: None
mp_parameters:
auto_find_batch_size: False
full_determinism: False
torchdynamo: None
ray_scope: last
ddp_timeout: 1800
torch_compile: False
torch_compile_backend: None
torch_compile_mode: None
dispatch_batches: None
split_batches: None
include_tokens_per_second: False
include_num_input_tokens_seen: False
neftune_noise_alpha: None
optim_target_modules: None
batch_eval_metrics: False
eval_on_start: False
use_liger_kernel: False
eval_use_gather_object: False
average_tokens_across_devices: False
prompts: None
batch_sampler: batch_sampler
multi_dataset_batch_sampler: round_robin

トレーニングログ

クリックして展開

Epoch	Step	トレーニング損失
0.0321	500	0.3396
0.0641	1000	0.2094
0.0962	1500	0.21
0.1283	2000	0.1955
0.1603	2500	0.1989
0.1924	3000	0.1851
0.2245	3500	0.1839
0.2565	4000	0.1859
0.2886	4500	0.1892
0.3207	5000	0.1865
0.3527	5500	0.1773
0.3848	6000	0.1796
0.4169	6500	0.1929
0.4489	7000	0.1829
0.4810	7500	0.172
0.5131	8000	0.1792
0.5451	8500	0.1747
0.5772	9000	0.1802
0.6092	9500	0.1856
0.6413	10000	0.1751
0.6734	10500	0.173
0.7054	11000	0.1774
0.7375	11500	0.1722
0.7696	12000	0.1825
0.8016	12500	0.1714
0.8337	13000	0.1732
0.8658	13500	0.167
0.8978	14000	0.1792
0.9299	14500	0.1697
0.9620	15000	0.1682
0.9940	15500	0.1764
1.0	15593	-
1.0261	16000	0.0875
1.0582	16500	0.0798
1.0902	17000	0.0764
1.1223	17500	0.0783
1.1544	18000	0.0759
1.1864	18500	0.0834
1.2185	19000	0.082
1.2506	19500	0.0827
1.2826	20000	0.0876
1.3147	20500	0.0819
1.3468	21000	0.0841
1.3788	21500	0.0815
1.4109	22000	0.0819
1.4430	22500	0.0883
1.4750	23000	0.0826
1.5071	23500	0.0837
1.5392	24000	0.086
1.5712	24500	0.0806
1.6033	25000	0.0918
1.6353	25500	0.0885
1.6674	26000	0.0885
1.6995	26500	0.088
1.7315	27000	0.0843
1.7636	27500	0.0915
1.7957	28000	0.0843
1.8277	28500	0.0868
1.8598	29000	0.0857
1.8919	29500	0.0931
1.9239	30000	0.0852
1.9560	30500	0.0913
1.9881	31000	0.0857
2.0	31186	-
2.0201	31500	0.0547
2.0522	32000	0.0459
2.0843	32500	0.0451
2.1163	33000	0.0407
2.1484	33500	0.0469
2.1805	34000	0.0459
2.2125	34500	0.0508
2.2446	35000	0.0508
2.2767	35500	0.0518
2.3087	36000	0.0552
2.3408	36500	0.0491
2.3729	37000	0.0575
2.4049	37500	0.0558
2.4370	38000	0.0475
2.4691	38500	0.0486
2.5011	39000	0.0536
2.5332	39500	0.0559
2.5653	40000	0.0524
2.5973	40500	0.0496
2.6294	41000	0.0486
2.6615	41500	0.0526
2.6935	42000	0.0443
2.7256	42500	0.058
2.7576	43000	0.0543
2.7897	43500	0.0527
2.8218	44000	0.0528
2.8538	44500	0.0573
2.8859	45000	0.0628
2.9180	45500	0.0443
2.9500	46000	0.0531
2.9821	46500	0.0554
3.0	46779	-
3.0142	47000	0.0346
3.0462	47500	0.0288
3.0783	48000	0.0219
3.1104	48500	0.0259
3.1424	49000	0.0237
3.1745	49500	0.0307
3.2066	50000	0.0234
3.2386	50500	0.0312
3.2707	51000	0.0297
3.3028	51500	0.0299
3.3348	52000	0.0326
3.3669	52500	0.0266
3.3990	53000	0.0296
3.4310	53500	0.0289
3.4631	54000	0.0216
3.4952	54500	0.0289
3.5272	55000	0.033
3.5593	55500	0.0248
3.5914	56000	0.0246
3.6234	56500	0.0287
3.6555	57000	0.0267
3.6876	57500	0.0285
3.7196	58000	0.0288
3.7517	58500	0.0283
3.7837	59000	0.0283
3.8158	59500	0.029
3.8479	60000	0.0327
3.8799	60500	0.0239
3.9120	61000	0.0356
3.9441	61500	0.0323
3.9761	62000	0.0213
4.0	62372	-
4.0082	62500	0.0275
4.0403	63000	0.0125
4.0723	63500	0.0183
4.1044	64000	0.0138
4.1365	64500	0.0174
4.1685	65000	0.0088
4.2006	65500	0.0126
4.2327	66000	0.0134
4.2647	66500	0.0099
4.2968	67000	0.0188
4.3289	67500	0.0112
4.3609	68000	0.0156
4.3930	68500	0.0175
4.4251	69000	0.0128
4.4571	69500	0.0154
4.4892	70000	0.0127
4.5213	70500	0.0131
4.5533	71000	0.017
4.5854	71500	0.0116
4.6175	72000	0.0137
4.6495	72500	0.0156
4.6816	73000	0.0155
4.7137	73500	0.0078
4.7457	74000	0.0152
4.7778	74500	0.0089
4.8099	75000	0.0116
4.8419	75500	0.0144
4.8740	76000	0.0112
4.9060	76500	0.0108
4.9381	77000	0.0188
4.9702	77500	0.0109
5.0	77965	-

フレームワークのバージョン

Python: 3.11.11
Sentence Transformers: 3.4.1
Transformers: 4.49.0
PyTorch: 2.5.1+cu124
Accelerate: 1.3.0
Datasets: 3.3.2
Tokenizers: 0.21.0

🔧 技術詳細

このモデルは、Sentence Transformerを使用して文章や段落を768次元の密ベクトル空間にマッピングします。ベースモデルとしてnomic-ai/nomic-embed-text-v2-moeを使用し、最大シーケンス長は8192トークンです。出力次元数は768次元で、類似度関数にはコサイン類似度を使用しています。

トレーニングには、498,970個のサンプルを持つ無名データセットを使用し、損失関数としてbeir.losses.bpr_loss.BPRLossを用いました。トレーニングのハイパーパラメータには、eval_strategyをstepsに設定し、per_device_train_batch_sizeとper_device_eval_batch_sizeを32に設定し、num_train_epochsを5に設定しました。

📄 ライセンス

詳細なライセンス情報は提供されていません。

引用

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}