đ multi-qa-mpnet-base-dot-v1
This is a sentence-transformers model designed for semantic search, mapping sentences and paragraphs to a 768-dimensional dense vector space. It's trained on 215M (question, answer) pairs from various sources.
đ Quick Start
This model can be used in two ways, with or without the sentence-transformers
library.
⨠Features
- Maps sentences & paragraphs to a 768-dimensional dense vector space.
- Designed for semantic search.
- Trained on 215M (question, answer) pairs from diverse sources.
đĻ Installation
If you want to use the sentence-transformers
way, you need to install the library first:
pip install -U sentence-transformers
đģ Usage Examples
Basic Usage (Sentence-Transformers)
from sentence_transformers import SentenceTransformer, util
query = "How many people live in London?"
docs = ["Around 9 Million people live in London", "London is known for its financial district"]
model = SentenceTransformer('sentence-transformers/multi-qa-mpnet-base-dot-v1')
query_emb = model.encode(query)
doc_emb = model.encode(docs)
scores = util.dot_score(query_emb, doc_emb)[0].cpu().tolist()
doc_score_pairs = list(zip(docs, scores))
doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True)
for doc, score in doc_score_pairs:
print(score, doc)
Advanced Usage (HuggingFace Transformers)
from transformers import AutoTokenizer, AutoModel
import torch
def cls_pooling(model_output):
return model_output.last_hidden_state[:,0]
def encode(texts):
encoded_input = tokenizer(texts, padding=True, truncation=True, return_tensors='pt')
with torch.no_grad():
model_output = model(**encoded_input, return_dict=True)
embeddings = cls_pooling(model_output)
return embeddings
query = "How many people live in London?"
docs = ["Around 9 Million people live in London", "London is known for its financial district"]
tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/multi-qa-mpnet-base-dot-v1")
model = AutoModel.from_pretrained("sentence-transformers/multi-qa-mpnet-base-dot-v1")
query_emb = encode(query)
doc_emb = encode(docs)
scores = torch.mm(query_emb, doc_emb.transpose(0, 1))[0].cpu().tolist()
doc_score_pairs = list(zip(docs, scores))
doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True)
for doc, score in doc_score_pairs:
print(score, doc)
đ§ Technical Details
Property |
Details |
Dimensions |
768 |
Produces normalized embeddings |
No |
Pooling-Method |
CLS pooling |
Suitable score functions |
dot-product (e.g. util.dot_score ) |
đ Documentation
Background
The project aims to train sentence embedding models on very large sentence level datasets using a self-supervised contrastive learning objective. The model should predict which out of a set of randomly sampled other sentences, was actually paired with a given sentence in the dataset.
This model was developed during the Community week using JAX/Flax for NLP & CV, organized by Hugging Face, as part of the project Train the Best Sentence Embedding Model Ever with 1B Training Pairs. The project benefited from efficient hardware infrastructure (7 TPUs v3-8) and the intervention from Google's Flax, JAX, and Cloud team members about efficient deep learning frameworks.
Intended uses
Our model is intended to be used for semantic search. It encodes queries / questions and text paragraphs in a dense vector space and finds relevant documents for the given passages.
â ī¸ Important Note
There is a limit of 512 word pieces. Text longer than that will be truncated. Also, the model was just trained on input text up to 250 word pieces, so it might not work well for longer text.
Training procedure
The full training script is accessible in this current repository: train_script.py
.
Pre-training
We use the pretrained mpnet-base
model. Please refer to the model card for more detailed information about the pre-training procedure.
Training
We use the concatenation from multiple datasets to fine-tune our model. In total, we have about 215M (question, answer) pairs. Each dataset was sampled given a weighted probability, and the configuration is detailed in the data_config.json
file.
The model was trained with MultipleNegativesRankingLoss using CLS-pooling, dot-product as the similarity function, and a scale of 1.
Dataset |
Number of training tuples |
WikiAnswers Duplicate question pairs from WikiAnswers |
77,427,422 |
PAQ Automatically generated (Question, Paragraph) pairs for each paragraph in Wikipedia |
64,371,441 |
Stack Exchange (Title, Body) pairs from all StackExchanges |
25,316,456 |
Stack Exchange (Title, Answer) pairs from all StackExchanges |
21,396,559 |
MS MARCO Triplets (query, answer, hard_negative) for 500k queries from Bing search engine |
17,579,773 |
GOOAQ: Open Question Answering with Diverse Answer Types (query, answer) pairs for 3M Google queries and Google featured snippet |
3,012,496 |
Amazon-QA (Question, Answer) pairs from Amazon product pages |
2,448,839 |
Yahoo Answers (Title, Answer) pairs from Yahoo Answers |
1,198,260 |
Yahoo Answers (Question, Answer) pairs from Yahoo Answers |
681,164 |
Yahoo Answers (Title, Question) pairs from Yahoo Answers |
659,896 |
SearchQA (Question, Answer) pairs for 140k questions, each with Top5 Google snippets on that question |
582,261 |
ELI5 (Question, Answer) pairs from Reddit ELI5 (explainlikeimfive) |
325,475 |
Stack Exchange Duplicate questions pairs (titles) |
304,525 |
Quora Question Triplets (Question, Duplicate_Question, Hard_Negative) triplets for Quora Questions Pairs dataset |
103,663 |
Natural Questions (NQ) (Question, Paragraph) pairs for 100k real Google queries with relevant Wikipedia paragraph |
100,231 |
SQuAD2.0 (Question, Paragraph) pairs from SQuAD2.0 dataset |
87,599 |
TriviaQA (Question, Evidence) pairs |
73,346 |
Total |
214,988,242 |