🚀 Tooka-SBERT-V2-Large
This model is a Sentence Transformers model designed for semantic textual similarity and embedding tasks. It maps sentences and paragraphs into a dense vector space, where semantically similar texts are closely located. The model comes in two sizes: Small and Large.
🚀 Quick Start
First, install the Sentence Transformers library:
pip install sentence-transformers==3.4.1
Then, you can load this model and perform inference.
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("PartAI/Tooka-SBERT-V2-Large")
sentences = [
'درنا از پرندگان مهاجر با پاهای بلند و گردن دراز است.',
'درناها با قامتی بلند و بالهای پهن، از زیباترین پرندگان مهاجر به شمار میروند.',
'درناها پرندگانی کوچک با پاهای کوتاه هستند که مهاجرت نمیکنند.'
]
embeddings = model.encode(sentences)
print(embeddings.shape)
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
✨ Features
- Vector Mapping: Maps sentences and paragraphs to a dense vector space for semantic similarity analysis.
- Two Sizes: Available in both small and large sizes to meet different application needs.
📦 Installation
pip install sentence-transformers==3.4.1
💻 Usage Examples
Basic Usage
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("PartAI/Tooka-SBERT-V2-Large")
sentences = [
'درنا از پرندگان مهاجر با پاهای بلند و گردن دراز است.',
'درناها با قامتی بلند و بالهای پهن، از زیباترین پرندگان مهاجر به شمار میروند.',
'درناها پرندگانی کوچک با پاهای کوتاه هستند که مهاجرت نمیکنند.'
]
embeddings = model.encode(sentences)
print(embeddings.shape)
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
🔧 Technical Details
The training process consists of two stages:
Stage 1: Pretraining
- Asymmetric Setup: An asymmetric configuration is used.
- Input Formatting:
- Titles are prefixed with
"سوال: "
.
- Texts are prefixed with
"متن: "
.
- Loss Function:
CachedMultipleNegativesRankingLoss
Stage 2: Fine-tuning
- Loss Functions:
CachedMultipleNegativesRankingLoss
CoSENTLoss
- Datasets: Applied across multiple synthetic datasets
📚 Documentation
Evaluation
We evaluated our model on the PTEB Benchmark. Our model outperforms mE5 - Base on average across PTEB tasks.
For Retrieval and Reranking tasks, we follow the same asymmetric structure, prepending:
"سوال: "
to queries
"متن: "
to documents
Property |
Details |
Model Type |
Sentence Transformers model for semantic textual similarity and embedding tasks |
Training Data |
Pretrained on the Targoman News dataset and fine - tuned on multiple synthetic datasets |
Task-Specific Datasets in PTEB
- Pair-Classification: FarsTail
- Classification: MassiveIntentClassification, MassiveScenarioClassification, MultilingualSentimentClassification, PersianFoodSentimentClassification
- Retrieval: MIRACLRetrieval, NeuCLIR2023Retrieval, WikipediaRetrievalMultilingual
- Reranking: MIRACLReranking, WikipediaRerankingMultilingual
📄 License
No license information provided in the original document.
📖 Citation
BibTeX
Sentence Transformers
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
CachedMultipleNegativesRankingLoss
@misc{gao2021scaling,
title={Scaling Deep Contrastive Learning Batch Size under Memory Limited Setup},
author={Luyu Gao and Yunyi Zhang and Jiawei Han and Jamie Callan},
year={2021},
eprint={2101.06983},
archivePrefix={arXiv},
primaryClass={cs.LG}
}