Sentest Open-Source Sentence Transformer Model - Free to Calculate Sentence Similarity and Conduct Semantic Search

Home

Sentest

Developed by palusi

This is a BERT-based sentence transformer model for calculating sentence similarity and semantic search tasks.

Text Embedding

TensorBoard

English#Sentence similarity calculation #Triplet loss optimization #Semantic search enhancement

Downloads 18

Release Time : 2/14/2025

Model Overview

The model is fine-tuned on the QQP_triplets dataset and can map sentences and paragraphs into a 768-dimensional dense vector space, suitable for tasks such as semantic textual similarity, semantic search, paraphrase mining, text classification, and clustering.

Model Features

Efficient sentence embedding

Converts sentences into 768-dimensional dense vectors, preserving semantic information

High-accuracy similarity calculation

Achieves 98.83% cosine accuracy on the test set

Long text support

Supports input sequences of up to 512 tokens

Model Capabilities

Semantic textual similarity calculation

Semantic search

Paraphrase mining

Text classification

Text clustering

Use Cases

Question answering systems

🚀 SentenceTransformer based on google-bert/bert-base-uncased

This is a sentence-transformers model. It's finetuned from google-bert/bert-base-uncased on the qqp_triplets dataset. It maps sentences and paragraphs to a 768 - dimensional dense vector space. You can use it for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

✨ Features

Language: Supports English (en).
Tags: Associated with sentence-transformers, sentence-similarity, feature-extraction, etc.
Base Model: Built upon google-bert/bert-base-uncased.
Loss Function: Utilizes TripletLoss.
Metrics: Evaluated with cosine_accuracy.

📦 Installation

First, you need to install the Sentence Transformers library:

pip install -U sentence-transformers

💻 Usage Examples

Basic Usage

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("palusi/sentest")
# Run inference
sentences = [
    'How can I open my computer if I forget my password?',
    'I forget my PC password what should I do to open it?',
    'I forgot my security code on my Nokia 206 how can I unlock it?',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

📚 Documentation

Model Details

Model Description

Property	Details
Model Type	Sentence Transformer
Base model	google-bert/bert-base-uncased
Maximum Sequence Length	512 tokens
Output Dimensionality	768 dimensions
Similarity Function	Cosine Similarity
Training Dataset	qqp_triplets
Language	en

Model Sources

Documentation: Sentence Transformers Documentation
Repository: Sentence Transformers on GitHub
Hugging Face: Sentence Transformers on Hugging Face

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

Evaluation

Metrics - Triplet

Dataset: sentest
Evaluated with: TripletEvaluator

Metric	Value
cosine_accuracy	0.9883

Training Details

Training Dataset - qqp_triplets

Dataset: qqp_triplets at f475d9c
Size: 101,762 training samples
Columns: anchor, positive, and negative

Approximate statistics based on the first 1000 samples:

	anchor	positive	negative
type	string	string	string
details	min: 6 tokens mean: 13.96 tokens max: 54 tokens	min: 5 tokens mean: 13.99 tokens max: 52 tokens	min: 6 tokens mean: 14.49 tokens max: 73 tokens

Samples:

anchor	positive	negative
`Who are Mona Punjabi?`	`Who are Mona punjabis?`	`Why are Punjabis so proud of their Punjabi - hood?`
`What are some of the best books on/by Bill Gates?`	`What are the best books of Bill Gates?`	`Are there any films about Bill Gates?`
`Where can I get best pasta in Bangalore?`	`Where can I get best pasta in Bangalore ?`	`Where can I get best street food in Bangalore?`

Loss: TripletLoss with these parameters:

{
    "distance_metric": "TripletDistanceMetric.EUCLIDEAN",
    "triplet_margin": 5
}

Evaluation Dataset - qqp_triplets

Dataset: qqp_triplets at f475d9c
Size: 101,762 evaluation samples
Columns: anchor, positive, and negative

Approximate statistics based on the first 1000 samples:

	anchor	positive	negative
type	string	string	string
details	min: 6 tokens mean: 13.99 tokens max: 61 tokens	min: 6 tokens mean: 13.76 tokens max: 49 tokens	min: 6 tokens mean: 14.75 tokens max: 78 tokens

Samples:

anchor	positive	negative
`How do l study efficiently?`	`How do you study effectively?`	`Why can't I study efficiently?`
`How do you commit suicide?`	`What is the easiest way to commite suicide?`	`What is a way to commit suicide and not damaging your organs so that they can be donated?`
`How do you learn to speak a foreign language?`	`What is the quickest way a person can learn to speak a new language fluently?`	`What's the easiest foreign language for a native English speaker, living in America, to learn to speak?`

Loss: TripletLoss with these parameters:

{
    "distance_metric": "TripletDistanceMetric.EUCLIDEAN",
    "triplet_margin": 5
}

Training Hyperparameters

Non - Default Hyperparameters

eval_strategy: steps
per_device_train_batch_size: 16
per_device_eval_batch_size: 16
learning_rate: 2e - 05
weight_decay: 0.01
num_train_epochs: 1
warmup_ratio: 0.1
fp16: True
load_best_model_at_end: True
push_to_hub: True
hub_model_id: palusi/sentest
batch_sampler: no_duplicates

Training Logs

Epoch	Step	Training Loss	Validation Loss	sentest_cosine_accuracy
-1	-1	-	-	0.8806
0.0983	500	2.5691	-	-
0.1965	1000	1.2284	0.6712	0.9645
0.2948	1500	0.8769	-	-
0.3930	2000	0.7151	0.4490	0.9787
0.4913	2500	0.6506	-	-
0.5895	3000	0.5855	0.3519	0.9848
0.6878	3500	0.5397	-	-
0.7860	4000	0.4998	0.3079	0.9871
0.8843	4500	0.4885	-	-
0.9825	5000	0.483	0.288	0.9883

The bold row denotes the saved checkpoint.

Framework Versions

Python: 3.11.11
Sentence Transformers: 3.4.1
Transformers: 4.48.2
PyTorch: 2.5.1+cu124
Accelerate: 1.3.0
Datasets: 3.2.0
Tokenizers: 0.21.0

📄 License

No license information provided in the original document.

📖 Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

TripletLoss

@misc{hermans2017defense,
    title={In Defense of the Triplet Loss for Person Re-Identification},
    author={Alexander Hermans and Lucas Beyer and Bastian Leibe},
    year={2017},
    eprint={1703.07737},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご