Tooka-SBERT-V2-Small Open-source Model - Accurately Achieve Semantic Text Similarity and Embedding Tasks

Tooka SBERT V2 Small

Developed by PartAI

Tooka-SBERT-V2-Small is a trained sentence transformer model for semantic text similarity and embedding tasks. It can map sentences and paragraphs to a dense vector space where semantically similar texts are close to each other.

Text Embedding #Persian semantic similarity #Dense vector embedding #Asymmetric text processing

Downloads 110

Release Time : 5/13/2025

Model Overview

This model is specifically designed to handle semantic similarity and embedding tasks for Persian texts, and its performance is optimized through two-stage training (pretraining and fine-tuning).

Model Features

Two-stage training

The model goes through two stages of pretraining and fine-tuning, which are optimized on the Targoman News dataset and multiple synthetic datasets respectively.

Asymmetric input processing

It supports adding specific prefixes (such as 'سوال:' and 'متن:') before input to distinguish different types of texts and optimize semantic understanding.

Efficient performance

It performs excellently on the PTEB Benchmark, and its average performance is better than that of the mE5-Base model.

Model Capabilities

Semantic text similarity calculation

Text embedding generation

Persian text processing

Use Cases

Information retrieval

Document retrieval

Use the embeddings generated by the model for document similarity search

It performs well on datasets such as MIRACLRetrieval

Text classification

Sentiment analysis

Use text embeddings for sentiment classification

It is effective in tasks such as PersianFoodSentimentClassification

Re-ranking

Search result optimization

Perform semantic re-ranking on the initial retrieval results

It performs excellently in tasks such as WikipediaRerankingMultilingual

🚀 Tooka-SBERT-V2-Small

This model is a Sentence Transformers model designed for semantic textual similarity and embedding tasks. It maps sentences and paragraphs into a dense vector space, where semantically similar texts are closely located. The model is available in two sizes: Small and Large.

🚀 Quick Start

✨ Features

Trained for semantic textual similarity and embedding tasks.
Available in two sizes: Small and Large.
Maps sentences and paragraphs to a dense vector space.

📦 Installation

First, install the Sentence Transformers library:

pip install sentence-transformers==3.4.1

💻 Usage Examples

Basic Usage

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("PartAI/Tooka-SBERT-V2-Small")
# Run inference
sentences = [
    'درنا از پرندگان مهاجر با پاهای بلند و گردن دراز است.',
    'درناها با قامتی بلند و بال‌های پهن، از زیباترین پرندگان مهاجر به شمار می‌روند.',
    'درناها پرندگانی کوچک با پاهای کوتاه هستند که مهاجرت نمی‌کنند.'
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

🔧 Technical Details

The training is performed in two stages:

Stage 1: Pretraining

We use an asymmetric setup.
Input formatting:
- Titles are prepended with "سوال: "
- Texts are prepended with "متن: "
Loss function: CachedMultipleNegativesRankingLoss

Stage 2: Fine-tuning

Loss functions:
- CachedMultipleNegativesRankingLoss
- CoSENTLoss
Used across multiple synthetic datasets

📚 Documentation

📊 Evaluation

We evaluate our model on the PTEB Benchmark. Our model outperforms mE5 - Base on average across PTEB tasks.

For Retrieval and Reranking tasks, we follow the same asymmetric structure, prepending:

"سوال: " to queries
"متن: " to documents

Property	Details
Model Type	Sentence Transformers model for semantic textual similarity and embedding
Training Data	Pretrained on Targoman News dataset and fine - tuned on multiple synthetic datasets

Model	#Params	Pair-Classification-Avg	Classification-Avg	Retrieval-Avg	Reranking-Avg	CrossTasks-Avg
Tooka-SBERT-V2-Large	353M	80.24	74.73	59.80	73.44	72.05
Tooka-SBERT-V2-Small	123M	75.69	72.16	61.24	73.40	70.62
jina-embeddings-v3	572M	71.88	79.27	65.18	64.62	70.24
multilingual-e5-base	278M	70.76	69.71	63.90	76.01	70.09
Tooka-SBERT-V1-Large	353M	81.52	71.54	45.61	60.44	64.78

Task-Specific Datasets in PTEB

Pair-Classification:
- FarsTail
Classification:
- MassiveIntentClassification
- MassiveScenarioClassification
- MultilingualSentimentClassification
- PersianFoodSentimentClassification
Retrieval:
- MIRACLRetrieval
- NeuCLIR2023Retrieval
- WikipediaRetrievalMultilingual
Reranking:
- MIRACLReranking
- WikipediaRerankingMultilingual

📄 License

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

CachedMultipleNegativesRankingLoss

@misc{gao2021scaling,
    title={Scaling Deep Contrastive Learning Batch Size under Memory Limited Setup}, 
    author={Luyu Gao and Yunyi Zhang and Jiawei Han and Jamie Callan},
    year={2021},
    eprint={2101.06983},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご