TY Ecomm Embed Multilingual Base V1.2.0
Model Overview
Model Features
Model Capabilities
Use Cases
🚀 Trendyol/TY-ecomm-embed-multilingual-base-v1.2.0
Trendyol/TY-ecomm-embed-multilingual-base-v1.2.0 is a multilingual sentence-transformers embedding model. It's fine - tuned on e - commerce datasets and optimized for semantic similarity, search, classification, and retrieval tasks. By integrating domain - specific signals from millions of real - world queries, product descriptions, and user interactions, this model is fine - tuned over a distilled version of Alibaba - NLP/gte - multilingual - base using a Turkish - English pair translation dataset.
🚀 Quick Start
The model is ready to use after installation. You can follow the usage examples below to start using it.
✨ Features
- Optimized for e - commerce semantic search.
- Enhanced Turkish and multilingual query understanding.
- Supports query rephrasing and paraphrase mining.
- Robust for product tagging and attribute extraction.
- Suitable for clustering and product categorization.
- High - performance in semantic textual similarity.
- 384 - token input support.
- 768 - dimensional dense vector outputs.
- Built - in cosine similarity for inference.
📦 Installation
First, install the Sentence Transformers library:
pip install -U sentence-transformers
💻 Usage Examples
Basic Usage
from sentence_transformers import SentenceTransformer
# Download from the Hub
matryoshka_dim = 768
model = SentenceTransformer("Trendyol/TY-ecomm-embed-multilingual-base-v1.2.0", trust_remote_code=True, truncate_dim=matryoshka_dim)
# Run inference
sentences = [
'120x190 yapıyor musunuz',
'merhaba 120 x 180 mevcüttür',
'Ürün stoklarımızda bulunmamaktadır',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]
# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]
📚 Documentation
Model Details
Model Description
Property | Details |
---|---|
Model Type | Sentence Transformer |
Maximum Sequence Length | 384 tokens |
Output Dimensionality | 768 dimensions |
Matryoshka Dimensions | 768, 512, 128 |
Similarity Function | Cosine Similarity |
Training Datasets | Multilingual and Turkish search terms, Turkish instruction datasets, Turkish summarization datasets, Turkish e - commerce rephrase datasets, Turkish question - answer pairs, and more! |
Model Sources
- Documentation: Sentence Transformers Documentation
- Repository: Sentence Transformers on GitHub
- Hugging Face: Sentence Transformers on Hugging Face
Full Model Architecture
SentenceTransformer(
(0): Transformer({'max_seq_length': 384, 'do_lower_case': False}) with Transformer model: NewModel
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
(2): Normalize()
)
🔧 Technical Details
Training Details
- Loss:
MatryoshkaLoss
with these parameters:{ "loss": "CachedMultipleNegativesSymmetricRankingLoss", "matryoshka_dims": [ 768, 512, 128 ], "matryoshka_weights": [ 1, 1, 1 ], "n_dims_per_step": -1 }
Training Hyperparameters
Non - Default Hyperparameters
overwrite_output_dir
: Trueeval_strategy
: stepsper_device_train_batch_size
: 2048per_device_eval_batch_size
: 128learning_rate
: 0.0005num_train_epochs
: 1warmup_ratio
: 0.01fp16
: Trueddp_timeout
: 300000batch_sampler
: no_duplicates
All Hyperparameters
Click to expand
overwrite_output_dir
: Truedo_predict
: Falseeval_strategy
: stepsprediction_loss_only
: Trueper_device_train_batch_size
: 2048per_device_eval_batch_size
: 128per_gpu_train_batch_size
: Noneper_gpu_eval_batch_size
: Nonegradient_accumulation_steps
: 1eval_accumulation_steps
: Nonetorch_empty_cache_steps
: Nonelearning_rate
: 0.0005weight_decay
: 0.0adam_beta1
: 0.9adam_beta2
: 0.999adam_epsilon
: 1e - 08max_grad_norm
: 1.0num_train_epochs
: 1max_steps
: -1lr_scheduler_type
: linearlr_scheduler_kwargs
: {}warmup_ratio
: 0.01warmup_steps
: 0log_level
: passivelog_level_replica
: warninglog_on_each_node
: Truelogging_nan_inf_filter
: Truesave_safetensors
: Truesave_on_each_node
: Falsesave_only_model
: Falserestore_callback_states_from_checkpoint
: Falseno_cuda
: Falseuse_cpu
: Falseuse_mps_device
: Falseseed
: 42data_seed
: Nonejit_mode_eval
: Falseuse_ipex
: Falsebf16
: Falsefp16
: Truefp16_opt_level
: O1half_precision_backend
: autobf16_full_eval
: Falsefp16_full_eval
: Falsetf32
: Nonelocal_rank
: 0ddp_backend
: Nonetpu_num_cores
: Nonetpu_metrics_debug
: Falsedebug
: []dataloader_drop_last
: Truedataloader_num_workers
: 0dataloader_prefetch_factor
: Nonepast_index
: -1disable_tqdm
: Falseremove_unused_columns
: Truelabel_names
: Noneload_best_model_at_end
: Falseignore_data_skip
: Falsefsdp
: []fsdp_min_num_params
: 0fsdp_config
: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}fsdp_transformer_layer_cls_to_wrap
: Noneaccelerator_config
: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}deepspeed
: Nonelabel_smoothing_factor
: 0.0optim
: adamw_torchoptim_args
: Noneadafactor
: Falsegroup_by_length
: Falselength_column_name
: lengthddp_find_unused_parameters
: Noneddp_bucket_cap_mb
: Noneddp_broadcast_buffers
: Falsedataloader_pin_memory
: Truedataloader_persistent_workers
: Falseskip_memory_metrics
: Trueuse_legacy_prediction_loop
: Falsepush_to_hub
: Falseresume_from_checkpoint
: Nonehub_model_id
: Nonehub_strategy
: every_savehub_private_repo
: Nonehub_always_push
: Falsegradient_checkpointing
: Falsegradient_checkpointing_kwargs
: Noneinclude_inputs_for_metrics
: Falseinclude_for_metrics
: []eval_do_concat_batches
: Truefp16_backend
: autopush_to_hub_model_id
: Nonepush_to_hub_organization
: Nonemp_parameters
:auto_find_batch_size
: Falsefull_determinism
: Falsetorchdynamo
: Noneray_scope
: lastddp_timeout
: 300000torch_compile
: Falsetorch_compile_backend
: Nonetorch_compile_mode
: Nonedispatch_batches
: Nonesplit_batches
: Noneinclude_tokens_per_second
: Falseinclude_num_input_tokens_seen
: Falseneftune_noise_alpha
: Noneoptim_target_modules
: Nonebatch_eval_metrics
: Falseeval_on_start
: Falseuse_liger_kernel
: Falseeval_use_gather_object
: Falseaverage_tokens_across_devices
: Falseprompts
: Nonebatch_sampler
: no_duplicatesmulti_dataset_batch_sampler
: proportional
Framework Versions
- Python: 3.11.11
- Sentence Transformers: 3.4.1
- Transformers: 4.48.1
- PyTorch: 2.5.1+cu124
- Accelerate: 1.5.1
- Datasets: 2.21.0
- Tokenizers: 0.21.1
Bias, Risks and Limitations
⚠️ Important Note
While this model is trained on e - commerce - related datasets, including multilingual and Turkish data, users should be aware of several limitations:
- Domain bias: Performance may degrade for content outside the e - commerce or product - related domains, such as legal, medical, or highly technical texts.
- Language coverage: Although multilingual data was included, the majority of the dataset is created in Turkish.
- Input length limitations: Inputs exceeding the maximum sequence length (384 tokens) will be truncated, potentially losing critical context in long texts.
- Spurious similarity: Semantic similarity may incorrectly assign high similarity scores to unrelated but lexically similar or frequently co - occurring phrases in training data.
Recommendations
💡 Usage Tip
- Human Oversight: We recommend incorporating a human curation layer or using filters to manage and improve the quality of outputs, especially in public - facing applications. This approach can help mitigate the risk of generating objectionable content unexpectedly.
- Application - Specific Testing: Developers intending to use Trendyol embedding models should conduct thorough safety testing and optimization tailored to their specific applications. This is crucial, as the model's outputs may occasionally be biased or inaccurate.
- Responsible Development and Deployment: It is the responsibility of developers and users of Trendyol embedding models to ensure its ethical and safe application. We urge users to be mindful of the model's limitations and to employ appropriate safeguards to prevent misuse or harmful consequences.
Citation
BibTeX
Sentence Transformers
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
MatryoshkaLoss
@misc{kusupati2024matryoshka,
title={Matryoshka Representation Learning},
author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard - Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
year={2024},
eprint={2205.13147},
archivePrefix={arXiv},
primaryClass={cs.LG}
}





