đ all-mpnet-base-v2
This is a sentence-transformers model that maps sentences & paragraphs to a 768-dimensional dense vector space, useful for tasks like clustering or semantic search.
đ Quick Start
This model can be used easily with sentence-transformers installed. First, install the library:
pip install -U sentence-transformers
Then, you can use the model as follows:
from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]
model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')
embeddings = model.encode(sentences)
print(embeddings)
⨠Features
- Maps sentences and paragraphs to a 768-dimensional dense vector space.
- Can be used for clustering, semantic search, and other related tasks.
đĻ Installation
To use this model, you need to install the sentence-transformers
library:
pip install -U sentence-transformers
đģ Usage Examples
Basic Usage
If you have sentence-transformers installed, using the model is straightforward:
from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]
model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')
embeddings = model.encode(sentences)
print(embeddings)
Advanced Usage
Without sentence-transformers, you can use the model by passing the input through the transformer model and applying the right pooling operation on the contextualized word embeddings:
from transformers import AutoTokenizer, AutoModel
import torch
import torch.nn.functional as F
def mean_pooling(model_output, attention_mask):
token_embeddings = model_output[0]
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
sentences = ['This is an example sentence', 'Each sentence is converted']
tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/all-mpnet-base-v2')
model = AutoModel.from_pretrained('sentence-transformers/all-mpnet-base-v2')
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
with torch.no_grad():
model_output = model(**encoded_input)
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
sentence_embeddings = F.normalize(sentence_embeddings, p=2, dim=1)
print("Sentence embeddings:")
print(sentence_embeddings)
đ Documentation
Evaluation Results
For an automated evaluation of this model, see the Sentence Embeddings Benchmark: https://seb.sbert.net
Background
The project aims to train sentence embedding models on very large sentence-level datasets using a self-supervised contrastive learning objective. The pretrained microsoft/mpnet-base
model was used and fine-tuned on a 1B sentence pairs dataset. A contrastive learning objective was applied, where given a sentence from a pair, the model should predict which out of a set of randomly sampled other sentences was actually paired with it in the dataset.
This model was developed during the Community week using JAX/Flax for NLP & CV organized by Hugging Face, as part of the project Train the Best Sentence Embedding Model Ever with 1B Training Pairs. Efficient hardware infrastructure (7 TPUs v3-8) was used, and there was intervention from Google's Flax, JAX, and Cloud team members regarding efficient deep learning frameworks.
Intended uses
This model is intended to be used as a sentence and short paragraph encoder. Given an input text, it outputs a vector that captures the semantic information. The sentence vector can be used for information retrieval, clustering, or sentence similarity tasks. By default, input text longer than 384 word pieces is truncated.
Training procedure
Pre-training
The pretrained microsoft/mpnet-base
model was used. Refer to the model card for more detailed information about the pre-training procedure.
Fine-tuning
The model was fine-tuned using a contrastive objective. Formally, the cosine similarity was computed for each possible sentence pair from the batch, and then the cross-entropy loss was applied by comparing with the true pairs.
Hyper parameters
The model was trained on a TPU v3-8 for 100k steps with a batch size of 1024 (128 per TPU core). A learning rate warm-up of 500 was used. The sequence length was limited to 128 tokens. The AdamW optimizer with a 2e-5 learning rate was used. The full training script is available in the current repository: train_script.py
.
Training data
Multiple datasets were concatenated to fine-tune the model. The total number of sentence pairs is over 1 billion. Each dataset was sampled with a weighted probability, and the configuration is detailed in the data_config.json
file.
Property |
Details |
Model Type |
Sentence embedding model |
Training Data |
Multiple datasets including Reddit comments, S2ORC, WikiAnswers, etc. with over 1 billion sentence pairs |
Dataset |
Paper |
Number of training tuples |
Reddit comments (2015-2018) |
paper |
726,484,430 |
S2ORC Citation pairs (Abstracts) |
paper |
116,288,806 |
WikiAnswers Duplicate question pairs |
paper |
77,427,422 |
PAQ (Question, Answer) pairs |
paper |
64,371,441 |
S2ORC Citation pairs (Titles) |
paper |
52,603,982 |
S2ORC (Title, Abstract) |
paper |
41,769,185 |
Stack Exchange (Title, Body) pairs |
- |
25,316,456 |
Stack Exchange (Title+Body, Answer) pairs |
- |
21,396,559 |
Stack Exchange (Title, Answer) pairs |
- |
21,396,559 |
MS MARCO triplets |
paper |
9,144,553 |
GOOAQ: Open Question Answering with Diverse Answer Types |
paper |
3,012,496 |
Yahoo Answers (Title, Answer) |
paper |
1,198,260 |
Code Search |
- |
1,151,414 |
COCO Image captions |
paper |
828,395 |
SPECTER citation triplets |
paper |
684,100 |
Yahoo Answers (Question, Answer) |
paper |
681,164 |
Yahoo Answers (Title, Question) |
paper |
659,896 |
SearchQA |
paper |
582,261 |
Eli5 |
paper |
325,475 |
Flickr 30k |
paper |
317,695 |
Stack Exchange Duplicate questions (titles) |
|
304,525 |
AllNLI (SNLI and MultiNLI |
paper SNLI, paper MultiNLI |
277,230 |
Stack Exchange Duplicate questions (bodies) |
|
250,519 |
Stack Exchange Duplicate questions (titles+bodies) |
|
250,460 |
Sentence Compression |
paper |
180,000 |
Wikihow |
paper |
128,542 |
Altlex |
paper |
112,696 |
Quora Question Triplets |
- |
103,663 |
Simple Wikipedia |
paper |
102,225 |
Natural Questions (NQ) |
paper |
100,231 |
SQuAD2.0 |
paper |
87,599 |
TriviaQA |
- |
73,346 |
Total |
|
1,170,060,424 |
đ License
This project is licensed under the Apache-2.0 license.