Tooka-SBERT-V2-Large Open-Source Model - Accurately Achieve Persian Semantic Text Similarity Analysis and Embedding

Tooka SBERT V2 Large

Developed by PartAI

A semantic text similarity and embedding model specifically designed for Persian, capable of mapping sentences into a dense vector space where semantically similar texts are positioned close to each other.

Text Embedding #Persian Semantic Similarity #Multi-task Fine-tuning #News Text Optimization

Downloads 127

Release Time : 5/13/2025

Model Overview

This model is a Sentence Transformers model for semantic text similarity and embedding tasks, available in both small and large sizes.

Model Features

Bilingual Support

Optimized specifically for Persian while also supporting English tasks

Two-stage Training

Adopts a two-stage training strategy of pre-training and fine-tuning to enhance model performance

Efficient Similarity Calculation

Capable of quickly calculating semantic similarity scores between sentences

Model Capabilities

Sentence similarity calculation

Text feature extraction

Semantic search

Information retrieval

Use Cases

Information Retrieval

Document Similarity Search

Finding semantically similar documents in Persian document collections

Achieved a retrieval task score of 59.80 on the PTEB benchmark

Text Classification

Sentiment Analysis

Performing sentiment classification on Persian texts

Achieved an average score of 74.73 on PTEB classification tasks

🚀 Tooka-SBERT-V2-Large

This model is a Sentence Transformers model designed for semantic textual similarity and embedding tasks. It maps sentences and paragraphs into a dense vector space, where semantically similar texts are closely located. The model comes in two sizes: Small and Large.

🚀 Quick Start

First, install the Sentence Transformers library:

pip install sentence-transformers==3.4.1

Then, you can load this model and perform inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("PartAI/Tooka-SBERT-V2-Large")
# Run inference
sentences = [
    'درنا از پرندگان مهاجر با پاهای بلند و گردن دراز است.',
    'درناها با قامتی بلند و بال‌های پهن، از زیباترین پرندگان مهاجر به شمار می‌روند.',
    'درناها پرندگانی کوچک با پاهای کوتاه هستند که مهاجرت نمی‌کنند.'
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 1024]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

✨ Features

Vector Mapping: Maps sentences and paragraphs to a dense vector space for semantic similarity analysis.
Two Sizes: Available in both small and large sizes to meet different application needs.

📦 Installation

pip install sentence-transformers==3.4.1

💻 Usage Examples

Basic Usage

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("PartAI/Tooka-SBERT-V2-Large")
# Run inference
sentences = [
    'درنا از پرندگان مهاجر با پاهای بلند و گردن دراز است.',
    'درناها با قامتی بلند و بال‌های پهن، از زیباترین پرندگان مهاجر به شمار می‌روند.',
    'درناها پرندگانی کوچک با پاهای کوتاه هستند که مهاجرت نمی‌کنند.'
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 1024]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

🔧 Technical Details

The training process consists of two stages:

Stage 1: Pretraining

Asymmetric Setup: An asymmetric configuration is used.
Input Formatting:
- Titles are prefixed with "سوال: ".
- Texts are prefixed with "متن: ".
Loss Function: CachedMultipleNegativesRankingLoss

Stage 2: Fine-tuning

Loss Functions:
- CachedMultipleNegativesRankingLoss
- CoSENTLoss
Datasets: Applied across multiple synthetic datasets

📚 Documentation

Evaluation

We evaluated our model on the PTEB Benchmark. Our model outperforms mE5 - Base on average across PTEB tasks.

For Retrieval and Reranking tasks, we follow the same asymmetric structure, prepending:

"سوال: " to queries
"متن: " to documents

Property	Details
Model Type	Sentence Transformers model for semantic textual similarity and embedding tasks
Training Data	Pretrained on the Targoman News dataset and fine - tuned on multiple synthetic datasets

Model	#Params	Pair-Classification-Avg	Classification-Avg	Retrieval-Avg	Reranking-Avg	CrossTasks-Avg
Tooka-SBERT-V2-Large	353M	80.24	74.73	59.80	73.44	72.05
Tooka-SBERT-V2-Small	123M	75.69	72.16	61.24	73.40	70.62
jina-embeddings-v3	572M	71.88	79.27	65.18	64.62	70.24
multilingual-e5-base	278M	70.76	69.71	63.90	76.01	70.09
Tooka-SBERT-V1-Large	353M	81.52	71.54	45.61	60.44	64.78

Task-Specific Datasets in PTEB

Pair-Classification: FarsTail
Classification: MassiveIntentClassification, MassiveScenarioClassification, MultilingualSentimentClassification, PersianFoodSentimentClassification
Retrieval: MIRACLRetrieval, NeuCLIR2023Retrieval, WikipediaRetrievalMultilingual
Reranking: MIRACLReranking, WikipediaRerankingMultilingual

📄 License

No license information provided in the original document.

📖 Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

CachedMultipleNegativesRankingLoss

@misc{gao2021scaling,
    title={Scaling Deep Contrastive Learning Batch Size under Memory Limited Setup}, 
    author={Luyu Gao and Yunyi Zhang and Jiawei Han and Jamie Callan},
    year={2021},
    eprint={2101.06983},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご