Intfloat Triplet V2
This is a sentence-transformers model fine-tuned from intfloat/multilingual-e5-small, designed to map sentences and paragraphs into a 384-dimensional dense vector space, supporting tasks like semantic textual similarity and semantic search.
Downloads 19
Release Time : 2/16/2025
Model Overview
This model is fine-tuned from intfloat/multilingual-e5-small using the all-nli-tr dataset, primarily for Turkish sentence similarity calculation and feature extraction.
Model Features
Multilingual support
Based on the multilingual-e5-small model, supporting multiple language processing
High-dimensional vector space
Maps text to a 384-dimensional dense vector space, capturing deep semantic features
Efficient training
Optimized using multiple negative ranking loss, trained on 482,091 samples
Model Capabilities
Semantic textual similarity calculation
Semantic search
Paraphrase mining
Text classification
Text clustering
Use Cases
Text processing
Similar sentence retrieval
Find semantically similar sentences in a document repository
Cosine accuracy reaches 0.928
Q&A system
Match user questions with answers in a knowledge base
🚀 SentenceTransformer based on intfloat/multilingual-e5-small
This model is fine-tuned from intfloat/multilingual-e5-small on the all-nli-tr dataset. It maps sentences and paragraphs to a 384-dimensional dense vector space, which can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
✨ Features
- Multilingual Capability: Based on a multilingual base model, it can handle various languages effectively.
- Semantic Understanding: Maps sentences and paragraphs to a 384 - dimensional dense vector space for semantic analysis.
- Versatile Applications: Suitable for multiple NLP tasks such as semantic similarity, search, and clustering.
📦 Installation
First, you need to install the Sentence Transformers library:
pip install -U sentence-transformers
💻 Usage Examples
Basic Usage
from sentence_transformers import SentenceTransformer
# Download from the 🤗 Hub
model = SentenceTransformer("x1saint/intfloat-triplet-v2")
# Run inference
sentences = [
'Ve gerçekten, baba haklıydı, oğlu zaten her şeyi tecrübe etmişti, her şeyi denedi ve daha az ilgileniyordu.',
'Oğlu her şeye olan ilgisini kaybediyordu.',
'Baba oğlunun tecrübe için hala çok şey olduğunu biliyordu.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 384]
# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]
📚 Documentation
Model Details
Model Description
Property | Details |
---|---|
Model Type | Sentence Transformer |
Base model | intfloat/multilingual-e5-small |
Maximum Sequence Length | 512 tokens |
Output Dimensionality | 384 dimensions |
Similarity Function | Cosine Similarity |
Training Dataset | all-nli-tr |
Language | tr |
Model Sources
- Documentation: Sentence Transformers Documentation
- Repository: Sentence Transformers on GitHub
- Hugging Face: Sentence Transformers on Hugging Face
Full Model Architecture
SentenceTransformer(
(0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel
(1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
(2): Normalize()
)
Evaluation
Metrics - Triplet
- Dataset:
all-nli-dev
- Evaluated with
TripletEvaluator
Metric | Value |
---|---|
cosine_accuracy | 0.928 |
Training Details
Training Dataset - all-nli-tr
- Dataset: all-nli-tr at daeabfb
- Size: 482,091 training samples
- Columns:
anchor
,positive
, andnegative
- Approximate statistics based on the first 1000 samples:
anchor positive negative type string string string details - min: 5 tokens
- mean: 28.16 tokens
- max: 151 tokens
- min: 5 tokens
- mean: 15.14 tokens
- max: 49 tokens
- min: 4 tokens
- mean: 14.33 tokens
- max: 55 tokens
- Samples:
anchor positive negative Mevsim boyunca ve sanırım senin seviyendeyken onları bir sonraki seviyeye düşürürsün. Eğer ebeveyn takımını çağırmaya karar verirlerse Braves üçlü A'dan birini çağırmaya karar verirlerse çifte bir adam onun yerine geçmeye gider ve bekar bir adam gelir.
Eğer insanlar hatırlarsa, bir sonraki seviyeye düşersin.
Hiçbir şeyi hatırlamazlar.
Numaramızdan biri talimatlarınızı birazdan yerine getirecektir.
Ekibimin bir üyesi emirlerinizi büyük bir hassasiyetle yerine getirecektir.
Şu anda boş kimsek yok, bu yüzden sen de harekete geçmelisin.
Bunu nereden biliyorsun? Bütün bunlar yine onların bilgileri.
Bu bilgi onlara ait.
Hiçbir bilgileri yok.
- Loss:
MultipleNegativesRankingLoss
with these parameters:{ "scale": 20.0, "similarity_fct": "cos_sim" }
Evaluation Dataset - all-nli-tr
- Dataset: all-nli-tr at daeabfb
- Size: 6,567 evaluation samples
- Columns:
anchor
,positive
, andnegative
- Approximate statistics based on the first 1000 samples:
anchor positive negative type string string string details - min: 3 tokens
- mean: 26.66 tokens
- max: 121 tokens
- min: 5 tokens
- mean: 14.98 tokens
- max: 49 tokens
- min: 4 tokens
- mean: 14.4 tokens
- max: 37 tokens
- Samples:
anchor positive negative Bilemiyorum. Onunla ilgili karışık duygularım var. Bazen ondan hoşlanıyorum ama aynı zamanda birisinin onu dövmesini görmeyi seviyorum.
Çoğunlukla ondan hoşlanıyorum, ama yine de birinin onu dövdüğünü görmekten zevk alıyorum.
O benim favorim ve kimsenin onu yendiğini görmek istemiyorum.
Sen ve arkadaşların burada hoş karşılanmaz, Severn söyledi.
Severn orada insanların hoş karşılanmadığını söyledi.
Severn orada insanların her zaman hoş karşılanacağını söyledi.
Gecenin en aşağısı ne olduğundan emin değilim.
Dün gece ne kadar soğuk oldu bilmiyorum.
Dün gece hava 37 dereceydi.
- Loss:
MultipleNegativesRankingLoss
with these parameters:{ "scale": 20.0, "similarity_fct": "cos_sim" }
Training Hyperparameters
Non - Default Hyperparameters
eval_strategy
: stepsper_device_train_batch_size
: 256per_device_eval_batch_size
: 256gradient_accumulation_steps
: 4num_train_epochs
: 10warmup_ratio
: 0.1bf16
: Truedataloader_num_workers
: 4
All Hyperparameters
Click to expand
overwrite_output_dir
: Falsedo_predict
: Falseeval_strategy
: stepsprediction_loss_only
: Trueper_device_train_batch_size
: 256per_device_eval_batch_size
: 256per_gpu_train_batch_size
: Noneper_gpu_eval_batch_size
: Nonegradient_accumulation_steps
: 4eval_accumulation_steps
: Nonetorch_empty_cache_steps
: Nonelearning_rate
: 5e-05weight_decay
: 0.0adam_beta1
: 0.9adam_beta2
: 0.999adam_epsilon
: 1e-08max_grad_norm
: 1.0num_train_epochs
: 10max_steps
: -1lr_scheduler_type
: linearlr_scheduler_kwargs
: {}warmup_ratio
: 0.1warmup_steps
: 0log_level
: passivelog_level_replica
: warninglog_on_each_node
: Truelogging_nan_inf_filter
: Truesave_safetensors
: Truesave_on_each_node
: Falsesave_only_model
: Falserestore_callback_states_from_checkpoint
: Falseno_cuda
: Falseuse_cpu
: Falseuse_mps_device
: Falseseed
: 42data_seed
: Nonejit_mode_eval
: Falseuse_ipex
: Falsebf16
: Truefp16
: Falsefp16_opt_level
: O1half_precision_backend
: autobf16_full_eval
: Falsefp16_full_eval
: Falsetf32
: Nonelocal_rank
: 0ddp_backend
: Nonetpu_num_cores
: Nonetpu_metrics_debug
: Falsedebug
: []dataloader_drop_last
: Falsedataloader_num_workers
: 4dataloader_prefetch_factor
: Nonepast_index
: -1disable_tqdm
: Falseremove_unused_columns
: Truelabel_names
: Noneload_best_model_at_end
: Falseignore_data_skip
: Falsefsdp
: []fsdp_min_num_params
: 0fsdp_config
: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}fsdp_transformer_layer_cls_to_wrap
: Noneaccelerator_config
: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}deepspeed
: Nonelabel_smoothing_factor
: 0.0optim
: adamw_torchoptim_args
: Noneadafactor
: Falsegroup_by_length
: Falselength_column_name
: lengthddp_find_unused_parameters
: Noneddp_bucket_cap_mb
: Noneddp_broadcast_buffers
: Falsedataloader_pin_memory
: Truedataloader_persistent_workers
: Falseskip_memory_metrics
: Trueuse_legacy_prediction_loop
: Falsepush_to_hub
: Falseresume_from_checkpoint
: Nonehub_model_id
: Nonehub_strategy
: every_savehub_private_repo
: Nonehub_always_push
: Falsegradient_checkpointing
: Falsegradient_checkpointing_kwargs
: Noneinclude_inputs_for_metrics
: Falseinclude_for_metrics
: []eval_do_concat_batches
: Truefp16_backend
: autopush_to_hub_model_id
: Nonepush_to_hub_organization
: Nonemp_parameters
:auto_find_batch_size
: Falsefull_determinism
: Falsetorchdynamo
: Noneray_scope
: lastddp_timeout
: 1800torch_compile
: Falsetorch_compile_backend
: Nonetorch_compile_mode
: Nonedispatch_batches
: Nonesplit_batches
: Noneinclude_tokens_per_second
: Falseinclude_num_input_tokens_seen
: Falseneftune_noise_alpha
: Noneoptim_target_modules
: Nonebatch_eval_metrics
: Falseeval_on_start
: Falseuse_liger_kernel
: Falseeval_use_gather_object
: Falseaverage_tokens_across_devices
: Falseprompts
: Nonebatch_sampler
: batch_samplermulti_dataset_batch_sampler
: proportional
Training Logs
Epoch | Step | Training Loss | Validation Loss | all-nli-dev_cosine_accuracy |
---|---|---|---|---|
1.0616 | 500 | 6.0902 | 0.7763 | 0.9024 |
2.1231 | 1000 | 3.6464 | 0.6962 | 0.9156 |
3.1847 | 1500 | 3.1127 | 0.6679 | 0.9191 |
4.2463 | 2000 | 2.8153 | 0.6608 | 0.9233 |
5.3079 | 2500 | 2.5886 | 0.6506 | 0.9252 |
6.3694 | 3000 | 2.4437 | 0.6478 | 0.9252 |
7.4310 | 3500 | 2.3393 | 0.6456 | 0.9263 |
8.4926 | 4000 | 2.2521 | 0.6414 | 0.9284 |
9.5541 | 4500 | 2.1913 | 0.6397 | 0.9280 |
Framework Versions
- Python: 3.11.11
- Sentence Transformers: 3.4.1
- Transformers: 4.48.3
- PyTorch: 2.5.1+cu124
- Accelerate: 1.3.0
- Datasets: 3.3.0
- Tokenizers: 0.21.0
📄 License
No license information provided in the original document.
📖 Citation
BibTeX
Sentence Transformers
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
MultipleNegativesRankingLoss
@misc{henderson2017efficient,
title={Efficient Natural Language Response Suggestion for Smart Reply},
author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
year={2017},
eprint={1705.00652},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
Jina Embeddings V3
Jina Embeddings V3 is a multilingual sentence embedding model supporting over 100 languages, specializing in sentence similarity and feature extraction tasks.
Text Embedding
Transformers Supports Multiple Languages

J
jinaai
3.7M
911
Ms Marco MiniLM L6 V2
Apache-2.0
A cross-encoder model trained on the MS Marco passage ranking task for query-passage relevance scoring in information retrieval
Text Embedding English
M
cross-encoder
2.5M
86
Opensearch Neural Sparse Encoding Doc V2 Distill
Apache-2.0
A sparse retrieval model based on distillation technology, optimized for OpenSearch, supporting inference-free document encoding with improved search relevance and efficiency over V1
Text Embedding
Transformers English

O
opensearch-project
1.8M
7
Sapbert From PubMedBERT Fulltext
Apache-2.0
A biomedical entity representation model based on PubMedBERT, optimized for semantic relation capture through self-aligned pre-training
Text Embedding English
S
cambridgeltl
1.7M
49
Gte Large
MIT
GTE-Large is a powerful sentence transformer model focused on sentence similarity and text embedding tasks, excelling in multiple benchmark tests.
Text Embedding English
G
thenlper
1.5M
278
Gte Base En V1.5
Apache-2.0
GTE-base-en-v1.5 is an English sentence transformer model focused on sentence similarity tasks, excelling in multiple text embedding benchmarks.
Text Embedding
Transformers Supports Multiple Languages

G
Alibaba-NLP
1.5M
63
Gte Multilingual Base
Apache-2.0
GTE Multilingual Base is a multilingual sentence embedding model supporting over 50 languages, suitable for tasks like sentence similarity calculation.
Text Embedding
Transformers Supports Multiple Languages

G
Alibaba-NLP
1.2M
246
Polybert
polyBERT is a chemical language model designed to achieve fully machine-driven ultrafast polymer informatics. It maps PSMILES strings into 600-dimensional dense fingerprints to numerically represent polymer chemical structures.
Text Embedding
Transformers

P
kuelumbus
1.0M
5
Bert Base Turkish Cased Mean Nli Stsb Tr
Apache-2.0
A sentence embedding model based on Turkish BERT, optimized for semantic similarity tasks
Text Embedding
Transformers Other

B
emrecan
1.0M
40
GIST Small Embedding V0
MIT
A text embedding model fine-tuned based on BAAI/bge-small-en-v1.5, trained with the MEDI dataset and MTEB classification task datasets, optimized for query encoding in retrieval tasks.
Text Embedding
Safetensors English
G
avsolatorio
945.68k
29
Featured Recommended AI Models