Tooka-SBERT Open-source Persian Embedding Model - Free Implementation of Text Semantic Similarity Calculation

Tooka SBERT

Developed by PartAI

This is a Persian sentence embedding model based on TookaBERT-Large, which maps text to a 1024-dimensional vector space for tasks such as semantic similarity calculation.

Text Embedding

Safetensors

OtherOpen Source License:Apache-2.0 #Persian sentence similarity #1024-dimensional vector embedding #Semantic search optimization

Downloads 2,847

Release Time : 12/3/2024

Model Overview

This model is a sentence transformer specifically designed for Persian, capable of converting sentences and paragraphs into dense vector representations, suitable for tasks like semantic text similarity, semantic search, text classification, and clustering.

Model Features

Persian Optimization

Specifically optimized for Persian text, accurately capturing Persian semantic features.

Efficient Similarity Calculation

Uses cosine similarity to quickly compute semantic similarity between sentences.

Large-scale Pretraining

Based on the TookaBERT-Large pretrained model, with strong semantic representation capabilities.

Model Capabilities

Semantic text similarity calculation

Semantic search

Paraphrase mining

Text classification

Text clustering

Use Cases

Information Retrieval

🚀 SentenceTransformer

This is a model based on the sentence-transformers framework. It maps sentences and paragraphs into a 1024 - dimensional dense vector space, which can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and other tasks.

🚀 Quick Start

This is a sentence-transformers model trained. It maps sentences & paragraphs to a 1024-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

✨ Features

Maps sentences and paragraphs to a 1024 - dimensional dense vector space.
Applicable for multiple natural language processing tasks such as semantic textual similarity, semantic search, etc.

📦 Installation

First install the Sentence Transformers library:

pip install -U sentence-transformers

💻 Usage Examples

Basic Usage

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("PartAI/Tooka-SBERT")
# Run inference
sentences = [
    'درنا از پرندگان مهاجر با پاهای بلند و گردن دراز است.',
    'درناها با قامتی بلند و بال‌های پهن، از زیباترین پرندگان مهاجر به شمار می‌روند.',
    'درناها پرندگانی کوچک با پاهای کوتاه هستند که مهاجرت نمی‌کنند.'
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 1024]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

📚 Documentation

Model Details

Model Description

Property	Details
Model Type	Sentence Transformer
Base model	TookaBERT-Large
Maximum Sequence Length	512 tokens
Output Dimensionality	1024 tokens
Similarity Function	Cosine Similarity
Language	Persian

📄 License

This project is licensed under the Apache-2.0 license.

📚 Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

CachedMultipleNegativesRankingLoss

@misc{gao2021scaling,
    title={Scaling Deep Contrastive Learning Batch Size under Memory Limited Setup}, 
    author={Luyu Gao and Yunyi Zhang and Jiawei Han and Jamie Callan},
    year={2021},
    eprint={2101.06983},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご