đ All MPNet base model (v2) for Semantic Search
This model maps sentences & paragraphs to a 768-dimensional dense vector space, facilitating tasks like clustering or semantic search.
đ Quick Start
This section provides two ways to use the model: with sentence-transformers
and with HuggingFace Transformers
.
⨠Features
- Maps sentences and paragraphs to a 768-dimensional dense vector space.
- Suitable for tasks such as clustering and semantic search.
- Can be used as a sentence and short paragraph encoder.
đĻ Installation
To use this model, you need to install sentence-transformers
if you haven't already:
pip install -U sentence-transformers
đģ Usage Examples
Basic Usage
If you have sentence-transformers installed, using this model is straightforward:
from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]
model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')
embeddings = model.encode(sentences)
print(embeddings)
Advanced Usage
Without sentence-transformers, you can use the model as follows. First, pass your input through the transformer model, then apply the appropriate pooling operation on top of the contextualized word embeddings.
from transformers import AutoTokenizer, AutoModel
import torch
import torch.nn.functional as F
def mean_pooling(model_output, attention_mask):
token_embeddings = model_output[0]
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
sentences = ['This is an example sentence', 'Each sentence is converted']
tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/all-mpnet-base-v2')
model = AutoModel.from_pretrained('sentence-transformers/all-mpnet-base-v2')
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
with torch.no_grad():
model_output = model(**encoded_input)
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
sentence_embeddings = F.normalize(sentence_embeddings, p=2, dim=1)
print("Sentence embeddings:")
print(sentence_embeddings)
đ Documentation
Evaluation Results
For an automated evaluation of this model, see the Sentence Embeddings Benchmark: https://seb.sbert.net
Background
The project aims to train sentence embedding models on very large sentence-level datasets using a self-supervised contrastive learning objective. The pretrained microsoft/mpnet-base
model was used and fine-tuned on a 1B sentence pairs dataset. The model was developed during the Community week using JAX/Flax for NLP & CV, organized by Hugging Face, as part of the project Train the Best Sentence Embedding Model Ever with 1B Training Pairs.
Intended uses
This model is intended to be used as a sentence and short paragraph encoder. Given an input text, it outputs a vector that captures the semantic information. The sentence vector can be used for information retrieval, clustering, or sentence similarity tasks. By default, input text longer than 384 word pieces is truncated.
Training procedure
Pre-training
The pretrained microsoft/mpnet-base
model is used. Refer to the model card for more detailed information about the pre-training procedure.
Fine-tuning
The model is fine-tuned using a contrastive objective. Formally, the cosine similarity is computed for each possible sentence pair from the batch, and then the cross-entropy loss is applied by comparing with the true pairs.
Hyper parameters
The model was trained on a TPU v3-8 for 100k steps with a batch size of 1024 (128 per TPU core). A learning rate warm-up of 500 was used, the sequence length was limited to 128 tokens, and the AdamW optimizer with a 2e-5 learning rate was employed.
Training data
Multiple datasets were concatenated to fine-tune the model, with a total of over 1 billion sentence pairs. Each dataset was sampled with a weighted probability, and the configuration is detailed in the data_config.json
file.
Property |
Details |
Model Type |
Sentence embedding model |
Training Data |
Concatenation of multiple datasets with a total of 1,170,060,424 sentence pairs |
Dataset |
Paper |
Number of training tuples |
Reddit comments (2015-2018) |
paper |
726,484,430 |
S2ORC Citation pairs (Abstracts) |
paper |
116,288,806 |
WikiAnswers Duplicate question pairs |
paper |
77,427,422 |
PAQ (Question, Answer) pairs |
paper |
64,371,441 |
S2ORC Citation pairs (Titles) |
paper |
52,603,982 |
S2ORC (Title, Abstract) |
paper |
41,769,185 |
Stack Exchange (Title, Body) pairs |
- |
25,316,456 |
Stack Exchange (Title+Body, Answer) pairs |
- |
21,396,559 |
Stack Exchange (Title, Answer) pairs |
- |
21,396,559 |
MS MARCO triplets |
paper |
9,144,553 |
GOOAQ: Open Question Answering with Diverse Answer Types |
paper |
3,012,496 |
Yahoo Answers (Title, Answer) |
paper |
1,198,260 |
Code Search |
- |
1,151,414 |
COCO Image captions |
paper |
828,395 |
SPECTER citation triplets |
paper |
684,100 |
Yahoo Answers (Question, Answer) |
paper |
681,164 |
Yahoo Answers (Title, Question) |
paper |
659,896 |
SearchQA |
paper |
582,261 |
Eli5 |
paper |
325,475 |
Flickr 30k |
paper |
317,695 |
Stack Exchange Duplicate questions (titles) |
|
304,525 |
AllNLI (SNLI and MultiNLI |
paper SNLI, paper MultiNLI |
277,230 |
Stack Exchange Duplicate questions (bodies) |
|
250,519 |
Stack Exchange Duplicate questions (titles+bodies) |
|
250,460 |
Sentence Compression |
paper |
180,000 |
Wikihow |
paper |
128,542 |
Altlex |
paper |
112,696 |
Quora Question Triplets |
- |
103,663 |
Simple Wikipedia |
paper |
102,225 |
Natural Questions (NQ) |
paper |
100,231 |
SQuAD2.0 |
paper |
87,599 |
TriviaQA |
- |
73,346 |
Total |
|
1,170,060,424 |
đ License
This project is licensed under the MIT license.