Snoweu_v2 Open-Source Sentence Embedding Model - Free for Sentence Similarity Calculation and Feature Extraction

Snoweu V2

Developed by fjavigv

Sentence embedding model based on Snowflake Arctic architecture, focusing on sentence similarity calculation and feature extraction

Text Embedding

Safetensors

#High-precision sentence similarity #Enterprise strategy semantic matching #Sustainable development information retrieval

Downloads 604

Release Time : 3/19/2025

Model Overview

This model is a sentence transformer specifically designed for calculating sentence similarity and extracting sentence features. It employs nested loss and multiple negative ranking loss for training, suitable for tasks like information retrieval and semantic search.

Model Features

Efficient sentence embedding

Capable of converting sentences into high-dimensional vector representations for similarity calculation and semantic analysis

Multiple loss functions

Utilizes nested loss and multiple negative ranking loss for training to enhance model performance

Large-scale training data

Trained on 29,911 data points, demonstrating strong generalization capabilities

Model Capabilities

Sentence similarity calculation

Semantic feature extraction

Information retrieval

Semantic search

Text matching

Use Cases

Information retrieval

Document similarity search

Finding the most similar documents to a query sentence within a large corpus

Achieved 0.98 accuracy@10 in testing

Business analysis

Business strategy matching

Identifying document passages relevant to specific business strategies

🚀 SentenceTransformer based on Snowflake/snowflake-arctic-embed-m-v1.5

This model is a fine - tuned sentence - transformers model derived from [Snowflake/snowflake - arctic - embed - m - v1.5](https://huggingface.co/Snowflake/snowflake - arctic - embed - m - v1.5). It maps sentences and paragraphs to a 768 - dimensional dense vector space, and can be applied in semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

✨ Features

Maps sentences and paragraphs to a 768 - dimensional dense vector space.
Suitable for various natural language processing tasks such as semantic textual similarity, semantic search, and more.

📦 Installation

First, install the Sentence Transformers library:

pip install -U sentence-transformers

💻 Usage Examples

Basic Usage

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("sentence_transformers_model_id")
# Run inference
sentences = [
    'What is the definition of a preliminary economic assessment in the context of evaluating projects for the recovery of critical raw materials?',
    '(39)\n\n‘preliminary economic assessment’ means an early - stage, conceptual assessment of the potential economic viability of a project for the recovery of critical raw materials from extractive waste;\n\n(40)\n\n‘magnetic resonance imaging device’ means a non - invasive medical device that uses magnetic fields to make anatomical images or any other device that uses magnetic fields to make images of the inside of object;\n\n(41)\n\n‘wind energy generator’ means the part of an onshore or offshore wind turbine that converts the mechanical energy of the rotor into electrical energy;\n\n(42)',
    'For the purposes of the first subparagraph of this paragraph, insurance undertakings referred to in point (a) of the first subparagraph of Article 1(3) of this Directive that are part of a group, on the basis of financial relationships referred to in point (c)(ii) of Article 212(1) of Directive 2009/138/EC, and which are subject to group supervision in accordance with points (a) to (c) of Article 213(2) of that Directive shall be treated as subsidiary undertakings of the parent undertaking of that group.\n\n9.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

📚 Documentation

Model Details

Model Description

Property	Details
Model Type	Sentence Transformer
Base model	[Snowflake/snowflake - arctic - embed - m - v1.5](https://huggingface.co/Snowflake/snowflake - arctic - embed - m - v1.5)
Maximum Sequence Length	512 tokens
Output Dimensionality	768 dimensions
Similarity Function	Cosine Similarity

Model Sources

Documentation: Sentence Transformers Documentation
Repository: [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence - transformers)
Hugging Face: [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence - transformers)

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Evaluation

Metrics

Information Retrieval

Evaluated with InformationRetrievalEvaluator

Metric	Value
cosine_accuracy@1	0.8225
cosine_accuracy@3	0.9526
cosine_accuracy@5	0.9725
cosine_accuracy@10	0.9873
cosine_precision@1	0.8225
cosine_precision@3	0.3175
cosine_precision@5	0.1945
cosine_precision@10	0.0987
cosine_recall@1	0.8225
cosine_recall@3	0.9526
cosine_recall@5	0.9725
cosine_recall@10	0.9873
cosine_ndcg@10	0.9141
cosine_mrr@10	0.8896
cosine_map@100	0.8903

Training Details

Training Dataset

Unnamed Dataset

Size: 29,911 training samples
Columns: sentence_0 and sentence_1
Approximate statistics based on the first 1000 samples:
sentence_0 sentence_1
type string string
details
min: 13 tokens
mean: 41.63 tokens
max: 252 tokens
min: 4 tokens
mean: 233.72 tokens
max: 512 tokens

	sentence_0	sentence_1
type	string	string
details	min: 13 tokens mean: 41.63 tokens max: 252 tokens	min: 4 tokens mean: 233.72 tokens max: 512 tokens

Samples:

sentence_0	sentence_1
`What measures must Member States take to ensure that workers who believe they have been discriminated against in terms of equal pay can establish their case before a competent authority or national court?`	Article 18 Shift of burden of proof 1. Member States shall take the appropriate measures, in accordance with their national judicial systems, to ensure that, when workers who consider themselves wronged because the principle of equal pay has not been applied to them establish before a competent authority or national court facts from which it may be presumed that there has been direct or indirect discrimination, it shall be for the respondent to prove that there has been no direct or indirect discrimination in relation to pay. 2. Member States shall ensure that, in administrative procedures or court proceedings regarding alleged direct or indirect discrimination in relation to pay, where an employer has not implemented the pay transparency obligations set out in Articles 5, 6, 7, 9 and 10, it is for the employer to prove that there has been no such discrimination. The first subparagraph of this paragraph shall not apply where the employer proves that the infringement of the obligati...
`What are the key considerations for recognizing and addressing discrimination in the context of compensation and penalties, particularly in relation to the gender pay gap?`	discrimination, in particular for substantive and procedural purposes, including to recognise the existence of discrimination, to decide on the appropriate comparator, to assess the proportionality, and to determine, where relevant, the level of compensation awarded or penalties imposed. An intersectional approach is important for understanding and addressing the gender pay gap. This clarification should not change the scope of employers’ obligations in regard to the pay transparency measures under this Directive. In particular, employers should not be required to gather data related to protected grounds other than sex.
`What is the process for aircraft operators and shipping companies regarding the surrendering of allowances in relation to their total emissions from the previous calendar year?`	(b) each aircraft operator surrenders a number of allowances that is equal to its total emissions during the preceding calendar year, as verified in accordance with Article 15; (c) each shipping company surrenders a number of allowances that is equal to its total emissions during the preceding calendar year, as verified in accordance with Article 3ge. Member States, administering Member States and administering authorities in respect of a shipping company shall ensure that allowances surrendered in accordance with the first subparagraph are subsequently cancelled. ▼M15 3 - e.

Loss: MatryoshkaLoss with these parameters:

{
    "loss": "MultipleNegativesRankingLoss",
    "matryoshka_dims": [
        768,
        512,
        256,
        128,
        64
    ],
    "matryoshka_weights": [
        1,
        1,
        1,
        1,
        1
    ],
    "n_dims_per_step": -1
}

Training Hyperparameters

Non - Default Hyperparameters

eval_strategy: steps
per_device_train_batch_size: 6
per_device_eval_batch_size: 6
num_train_epochs: 4
multi_dataset_batch_sampler: round_robin

All Hyperparameters

Click to expand

overwrite_output_dir: False
do_predict: False
eval_strategy: steps
prediction_loss_only: True
per_device_train_batch_size: 6
per_device_eval_batch_size: 6
per_gpu_train_batch_size: None
per_gpu_eval_batch_size: None
gradient_accumulation_steps: 1
eval_accumulation_steps: None
torch_empty_cache_steps: None
learning_rate: 5e - 05
weight_decay: 0.0
adam_beta1: 0.9
adam_beta2: 0.999
adam_epsilon: 1e - 08
max_grad_norm: 1
num_train_epochs: 4
max_steps: -1
lr_scheduler_type: linear
lr_scheduler_kwargs: {}
warmup_ratio: 0.0
warmup_steps: 0
log_level: passive
log_level_replica: warning
log_on_each_node: True
logging_nan_inf_filter: True
save_safetensors: True
save_on_each_node: False
save_only_model: False
restore_callback_states_from_checkpoint: False
no_cuda: False
use_cpu: False
use_mps_device: False
seed: 42
data_seed: None
jit_mode_eval: False
use_ipex: False
bf16: False
fp16: False
fp16_opt_level: O1
half_precision_backend: auto
bf16_full_eval: False
fp16_full_eval: False
tf32: None
local_rank: 0
ddp_backend: None
tpu_num_cores: None
tpu_metrics_debug: False
debug: []
dataloader_drop_last: False
dataloader_num_workers: 0
dataloader_prefetch_factor: None
past_index: -1
disable_tqdm: False
remove_unused_columns: True
label_names: None
load_best_model_at_end: False
ignore_data_skip: False
fsdp: []
fsdp_min_num_params: 0
fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
fsdp_transformer_layer_cls_to_wrap: None
accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
deepspeed: None
label_smoothing_factor: 0.0
optim: adamw_torch
optim_args: None
adafactor: False
group_by_length: False
length_column_name: length
ddp_find_unused_parameters: None
ddp_bucket_cap_mb: None
ddp_broadcast_buffers: False
dataloader_pin_memory: True
dataloader_persistent_workers: False
skip_memory_metrics: True
use_legacy_prediction_loop: False
push_to_hub: False
resume_from_checkpoint: None
hub_model_id: None
hub_strategy: every_save
hub_private_repo: None
hub_always_push: False
gradient_checkpointing: False
gradient_checkpointing_kwargs: None
include_inputs_for_metrics: False
include_for_metrics: []
eval_do_concat_batches: True
fp16_backend: auto
push_to_hub_model_id: None
push_to_hub_organization: None
mp_parameters:
auto_find_batch_size: False
full_determinism: False
torchdynamo: None
ray_scope: last
ddp_timeout: 1800
torch_compile: False
torch_compile_backend: None
torch_compile_mode: None
dispatch_batches: None
split_batches: None
include_tokens_per_second: False
include_num_input_tokens_seen: False
neftune_noise_alpha: None
optim_target_modules: None
batch_eval_metrics: False
eval_on_start: False
use_liger_kernel: False
eval_use_gather_object: False
average_tokens_across_devices: False
prompts: None
batch_sampler: batch_sampler
multi_dataset_batch_sampler: round_robin

Training Logs

Click to expand

Epoch	Step	Training Loss	cosine_ndcg@10
0.0201	100	-	0.6629
0.0401	200	-	0.7746
0.0602	300	-	0.8233
0.0802	400	-	0.8515
0.1003	500	0.4694	0.8621
0.1203	600	-	0.8680
0.1404	700	-	0.8733
0.1604	800	-	0.8774
0.1805	900	-	0.8757
0.2006	1000	0.1568	0.8795
0.2206	1100	-	0.8808
0.2407	1200	-	0.8789
0.2607	1300	-	0.8796
0.2808	1400	-	0.8822
0.3008	1500	0.1015	0.8821
0.3209	1600	-	0.8814
0.3410	1700	-	0.8756
0.3610	1800	-	0.8822
0.3811	1900	-	0.8848
0.4011	2000	0.0836	0.8843
0.4212	2100	-	0.8841
0.4412	2200	-	0.8803
0.4613	2300	-	0.8851
0.4813	2400	-	0.8818
0.5014	2500	0.0865	0.8849
0.5215	2600	-	0.8877
0.5415	2700	-	0.8806
0.5616	2800	-	0.8832
0.5816	2900	-	0.8930
0.6017	3000	0.0842	0.8928
0.6217	3100	-	0.8882
0.6418	3200	-	0.8858
0.6619	3300	-	0.8863
0.6819	3400	-	0.8828
0.7020	3500	0.0669	0.8839
0.7220	3600	-	0.8835
0.7421	3700	-	0.8854
0.7621	3800	-	0.8839
0.7822	3900	-	0.8882
0.8022	4000	0.0695	0.8871
0.8223	4100	-	0.8854
0.8424	4200	-	0.8822
0.8624	4300	-	0.8847
0.8825	4400	-	0.8863
0.9025	4500	0.0575	0.8819
0.9226	4600	-	0.8815
0.9426	4700	-	0.8836
0.9627	4800	-	0.8862
0.9828	4900	-	0.8889
1.0	4986	-	0.8927
1.0028	5000	0.0712	0.8935
1.0229	5100	-	0.8890
1.0429	5200	-	0.8919
1.0630	5300	-	0.8949
1.0830	5400	-	0.8950
1.1031	5500	0.0485	0.8934
1.1231	5600	-	0.8964
1.1432	5700	-	0.8953
1.1633	5800	-	0.8942
1.1833	5900	-	0.8929
1.2034	6000	0.0465	0.8912
1.2234	6100	-	0.8890
1.2435	6200	-	0.8914
1.2635	6300	-	0.8847
1.2836	6400	-	0.8873
1.3037	6500	0.0324	0.8912
1.3237	6600	-	0.8956
1.3438	6700	-	0.8954
1.3638	6800	-	0.8946
1.3839	6900	-	0.8931
1.4039	7000	0.0205	0.8951
1.4240	7100	-	0.8967
1.4440	7200	-	0.8960
1.4641	7300	-	0.8943
1.4842	7400	-	0.9003
1.5042	7500	0.0489	0.8946
1.5243	7600	-	0.8986
1.5443	7700	-	0.8945
1.5644	7800	-	0.8960
1.5844	7900	-	0.8987
1.6045	8000	0.039	0.8991
1.6245	8100	-	0.8959
1.6446	8200	-	0.8948
1.6647	8300	-	0.8933
1.6847	8400	-	0.8926
1.7048	8500	0.0297	0.8937
1.7248	8600	-	0.8974
1.7449	8700	-	0.8977
1.7649	8800	-	0.8973
1.7850	8900	-	0.8989
1.8051	9000	0.0248	0.8974
1.8251	9100	-	0.8980
1.8452	9200	-	0.8970
1.8652	9300	-	0.8997
1.8853	9400	-	0.9007
1.9053	9500	0.0534	0.9009
1.9254	9600	-	0.9015
1.9454	9700	-	0.9014
1.9655	9800	-	0.9008
1.9856	9900	-	0.9024
2.0	9972	-	0.9052
2.0056	10000	0.0295	0.9041
2.0257	10100	-	0.9009
2.0457	10200	-	0.9030
2.0658	10300	-	0.9028
2.0858	10400	-	0.9051
2.1059	10500	0.027	0.9063
2.1260	10600	-	0.9059
2.1460	10700	-	0.9044
2.1661	10800	-	0.9024
2.1861	10900	-	0.9005
2.2062	11000	0.0201	0.8996
2.2262	11100	-	0.9037
2.2463	11200	-	0.9029
2.2663	11300	-	0.9047
2.2864	11400	-	0.9030
2.3065	11500	0.0097	0.9041
2.3265	11600	-	0.9011
2.3466	11700	-	0.9000
2.3666	11800	-	0.8972
2.3867	11900	-	0.8985
2.4067	12000	0.0165	0.8979
2.4268	12100	-	0.8996
2.4469	12200	-	0.9026
2.4669	12300	-	0.9034
2.4870	12400	-	0.9054
2.5070	12500	0.0165	0.9029
2.5271	12600	-	0.9052
2.5471	12700	-	0.9057
2.5672	12800	-	0.9059
2.5872	12900	-	0.9092
2.6073	13000	0.0144	0.9081
2.6274	13100	-	0.9095
2.6474	13200	-	0.9102
2.6675	13300	-	0.9113
2.6875	13400	-	0.9103
2.7076	13500	0.0159	0.9105
2.7276	13600	-	0.9073
2.7477	13700	-	0.9084
2.7677	13800	-	0.9080
2.7878	13900	-	0.9083
2.8079	14000	0.0183	0.9083
2.8279	14100	-	0.9070
2.8480	14200	-	0.9085
2.8680	14300	-	0.9078
2.8881	14400	-	0.9075
2.9081	14500	0.0257	0.9073
2.9282	14600	-	0.9098
2.9483	14700	-	0.9089
2.9683	14800	-	0.9097
2.9884	14900	-	0.9079
3.0	14958	-	0.9081
3.0084	15000	0.0144	0.9084
3.0285	15100	-	0.9083
3.0485	15200	-	0.9078
3.0686	15300	-	0.9079
3.0886	15400	-	0.9089
3.1087	15500	0.0082	0.9093
3.1288	15600	-	0.9098
3.1488	15700	-	0.9106
3.1689	15800	-	0.9103
3.1889	15900	-	0.9110
3.2090	16000	0.0185	0.9117
3.2290	16100	-	0.9116
3.2491	16200	-	0.9125
3.2692	16300	-	0.9111
3.2892	16400	-	0.9109
3.3093	16500	0.0105	0.9125
3.3293	16600	-	0.9117
3.3494	16700	-	0.9118
3.3694	16800	-	0.9117
3.3895	16900	-	0.9137
3.4095	17000	0.019	0.9134
3.4296	17100	-	0.9129
3.4497	17200	-	0.9126
3.4697	17300	-	0.9133
3.4898	17400	-	0.9136
3.5098	17500	0.0109	0.9120
3.5299	17600	-	0.9124
3.5499	17700	-	0.9122
3.5700	17800	-	0.9129
3.5901	17900	-	0.9132
3.6101	18000	0.0207	0.9139
3.6302	18100	-	0.9134
3.6502	18200	-	0.9135
3.6703	18300	-	0.9139
3.6903	18400	-	0.9141
3.7104	18500	0.0105	0.9139
3.7304	18600	-	0.9138
3.7505	18700	-	0.9136
3.7706	18800	-	0.9141

Framework Versions

Python: 3.10.11
Sentence Transformers: 3.4.1
Transformers: 4.48.1
PyTorch: 2.4.0+cu121
Accelerate: 1.4.0
Datasets: 3.3.2
Tokenizers: 0.21.0

📄 License

The model is based on the Sentence Transformers framework. For the license information of Sentence Transformers, please refer to its official documentation.

📖 Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MatryoshkaLoss

@misc{kusupati2024matryoshka,
    title={Matryoshka Representation Learning},
    author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard - Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
    year={2024},
    eprint={2205.13147},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply},
    author={Matthew Henderson and Rami Al - Rfou and Brian Strope and Yun - hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご