Open Source sentence-transformers-all-mini-lm-l6-v2 Model - Efficiently Calculate Sentence Similarity

Sentence Transformers All Mini Lm L6 V2

Developed by danielpark

A lightweight sentence embedding model optimized based on the MiniLM architecture, specifically designed for efficient sentence similarity calculation

Text Embedding

Safetensors

EnglishOpen Source License:Apache-2.0 #English Sentence Similarity #Efficient and Lightweight #Contrastive Learning Optimization

Downloads 78

Release Time : 10/13/2023

Model Overview

This model, fine-tuned through contrastive learning, can encode sentences into representations in a high-dimensional vector space for calculating semantic similarity between sentences. It significantly reduces model size and improves inference speed while maintaining high performance.

Model Features

Efficient and Lightweight

Only 80MB in size, with an inference speed of 14,200 sentences per second, suitable for deployment in resource-constrained environments

Multi-domain Adaptation

Fine-tuned on 17 different domain datasets, including academic papers, Q&A communities, and technical documents

Contrastive Learning Optimization

Fine-tuned using in-batch negative sampling strategy and cosine similarity contrastive loss

Model Capabilities

Sentence vectorization

Semantic similarity calculation

Semantic search support

Text feature extraction

Use Cases

Information Retrieval

Q&A System Matching

Encode user questions and knowledge base questions to match the most similar results

Performs well on retrieval benchmarks like MS MARCO

Content Deduplication

Community Q&A Deduplication

Identify duplicate questions on platforms like StackExchange

Optimized based on the WikiAnswers dataset

🚀 Sentence Transformers

This project forks sentence-transformers/all-MiniLM-L6-v2 as it aligns well with the target dataset and use - case. For more details, refer to the pre - trained model weight repository.

🚀 Quick Start

We forked the sentence-transformers/all-MiniLM-L6-v2 model due to its similarity to the target dataset and use - case. You can find more details in the pre - trained model weight repository:

https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2
Commit Hash: 7dbbc90392e2f80f3d3c277d6e90027e55de9125

✨ Features

Forked from sentence-transformers/all-MiniLM-L6-v2 for better alignment with the target dataset and use - case.
Fine - tuned using a contrastive objective for sentence similarity tasks.

🔧 Technical Details

Fine - tuning

The model is fine - tuned using a contrastive objective.
Cosine similarity is computed for each possible sentence pair in the batch.
Cross - entropy loss is applied by comparing with true pairs.

Hyperparameters

The model is trained for 100k steps with a batch size of 1024 (128 per TPU core).
A learning rate warm - up of 500 is used.
The sequence length is limited to 128 tokens.
The AdamW optimizer with a learning rate of 2e - 5 is employed.
The full training script can be found in the current repository: train_script.py.

Performance

Model Name	Performance Sentence Embeddings (14 Datasets)	Performance Semantic Search (6 Datasets)	Avg. Performance	Speed	Model Size
all - mpnet - base - v2	69.57	57.02	63.30	2800	420 MB
multi - qa - mpnet - base - dot - v1	66.76	57.60	62.18	2800	420 MB
all - distilroberta - v1	68.73	50.94	59.84	4000	290 MB
all - MiniLM - L12 - v2	68.70	50.82	59.76	7500	120 MB
multi - qa - distilbert - cos - v1	65.98	52.83	59.41	4000	250 MB
all - MiniLM - L6 - v2 (This model)	68.06	49.54	58.80	14200	80 MB
multi - qa - MiniLM - L6 - cos - v1	64.33	51.83	58.08	14200	80 MB
paraphrase - multilingual - mpnet - base - v2	65.83	41.68	53.75	2500	970 MB
paraphrase - albert - small - v2	64.46	40.04	52.25	5000	43 MB
paraphrase - multilingual - MiniLM - L12 - v2	64.25	39.19	51.72	7500	420 MB
paraphrase - MiniLM - L3 - v2	62.29	39.19	50.74	19000	61 MB
distiluse - base - multilingual - cased - v1	61.30	29.87	45.59	4000	480 MB
distiluse - base - multilingual - cased - v2	60.18	27.35	43.77	4000	480 MB

Datasets

Dataset	Paper	Number of training tuples
[Reddit comments (2015 - 2018)](https://github.com/PolyAI - LDN/conversational - datasets/tree/master/reddit)	paper	726,484,430
S2ORC Citation pairs (Abstracts)	[paper](https://aclanthology.org/2020.acl - main.447/)	116,288,806
[WikiAnswers](https://github.com/afader/oqa#wikianswers - corpus) Duplicate question pairs	paper	77,427,422
PAQ (Question, Answer) pairs	paper	64,371,441
S2ORC Citation pairs (Titles)	[paper](https://aclanthology.org/2020.acl - main.447/)	52,603,982
S2ORC (Title, Abstract)	[paper](https://aclanthology.org/2020.acl - main.447/)	41,769,185
[Stack Exchange](https://huggingface.co/datasets/flax - sentence - embeddings/stackexchange_xml) (Title, Body) pairs	-	25,316,456
[Stack Exchange](https://huggingface.co/datasets/flax - sentence - embeddings/stackexchange_xml) (Title+Body, Answer) pairs	-	21,396,559
[Stack Exchange](https://huggingface.co/datasets/flax - sentence - embeddings/stackexchange_xml) (Title, Answer) pairs	-	21,396,559
MS MARCO triplets	paper	9,144,553
GOOAQ: Open Question Answering with Diverse Answer Types	paper	3,012,496
[Yahoo Answers](https://www.kaggle.com/soumikrakshit/yahoo - answers - dataset) (Title, Answer)	[paper](https://proceedings.neurips.cc/paper/2015/hash/250cf8b51c773f3f8dc8b4be867a9a02 - Abstract.html)	1,198,260
Code Search	-	1,151,414
COCO Image captions	[paper](https://link.springer.com/chapter/10.1007%2F978 - 3 - 319 - 10602 - 1_48)	828,395
SPECTER citation triplets	[paper](https://doi.org/10.18653/v1/2020.acl - main.207)	684,100
[Yahoo Answers](https://www.kaggle.com/soumikrakshit/yahoo - answers - dataset) (Question, Answer)	[paper](https://proceedings.neurips.cc/paper/2015/hash/250cf8b51c773f3f8dc8b4be867a9a02 - Abstract.html)	681,164
[Yahoo Answers](https://www.kaggle.com/soumikrakshit/yahoo - answers - dataset) (Title, Question)	[paper](https://proceedings.neurips.cc/paper/2015/hash/250cf8b51c773f3f8dc8b4be867a9a02 - Abstract.html)	659,896
SearchQA	paper	582,261
Eli5	[paper](https://doi.org/10.18653/v1/p19 - 1346)	325,475
Flickr 30k	paper	317,695
[Stack Exchange](https://huggingface.co/datasets/flax - sentence - embeddings/stackexchange_xml) Duplicate questions (titles)		304,525
AllNLI (SNLI and MultiNLI	[paper SNLI](https://doi.org/10.18653/v1/d15 - 1075), [paper MultiNLI](https://doi.org/10.18653/v1/n18 - 1101)	277,230
[Stack Exchange](https://huggingface.co/datasets/flax - sentence - embeddings/stackexchange_xml) Duplicate questions (bodies)		250,519
[Stack Exchange](https://huggingface.co/datasets/flax - sentence - embeddings/stackexchange_xml) Duplicate questions (titles+bodies)		250,460
[Sentence Compression](https://github.com/google - research - datasets/sentence - compression)	[paper](https://www.aclweb.org/anthology/D13 - 1155/)	180,000
Wikihow	paper	128,542
Altlex	[paper](https://aclanthology.org/P16 - 1135.pdf)	112,696
[Quora Question Triplets](https://quoradata.quora.com/First - Quora - Dataset - Release - Question - Pairs)	-	103,663
Simple Wikipedia	[paper](https://www.aclweb.org/anthology/P11 - 2117/)	102,225
Natural Questions (NQ)	paper	100,231
[SQuAD2.0](https://rajpurkar.github.io/SQuAD - explorer/)	[paper](https://aclanthology.org/P18 - 2124.pdf)	87,599
TriviaQA	-	73,346
Total		1,170,060,424

📄 License

This project is licensed under the Apache 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご