Multi-qa-mpnet-base-dot-v1 Open-source Semantic Search Model - Achieve Precise Mapping of Sentences and Paragraphs for Free

Multi Qa Mpnet Base Dot V1

Developed by model-embeddings

This is a semantic search model based on sentence-transformers, capable of mapping sentences and paragraphs into a 768-dimensional dense vector space.

Text Embedding

PyTorch

#Semantic Search Optimization #Question-Answer Pair Embedding #Dense Vector Retrieval

Downloads 772

Release Time : 7/23/2023

Model Overview

This model is specifically designed for semantic search, trained on 215 million (question, answer) pairs, suitable for Q&A and document retrieval tasks.

Model Features

Large-scale Training Data

The model was trained on 215 million (question, answer) pairs, covering multiple data sources.

Efficient Semantic Search

Optimized for semantic search, capable of quickly matching queries with relevant documents.

CLS Pooling

Uses CLS pooling method to generate sentence embeddings, suitable for dot product similarity calculation.

Model Capabilities

Sentence embedding generation

Semantic similarity calculation

Question-answer matching

Document retrieval

Use Cases

Information Retrieval

Q&A System

Used to match user questions with the best answers in a pre-stored answer database.

Can accurately find the most semantically relevant answers to questions.

Document Search

Quickly find the most relevant documents among a large number of documents based on queries.

Improves search efficiency and accuracy.

Content Recommendation

🚀 multi-qa-mpnet-base-dot-v1

This is a sentence-transformers model designed for semantic search, mapping sentences and paragraphs to a 768-dimensional dense vector space. It's trained on 215M (question, answer) pairs from various sources.

🚀 Quick Start

This model can be used in two ways, with or without the sentence-transformers library.

✨ Features

Maps sentences & paragraphs to a 768-dimensional dense vector space.
Designed for semantic search.
Trained on 215M (question, answer) pairs from diverse sources.

📦 Installation

If you want to use the sentence-transformers way, you need to install the library first:

pip install -U sentence-transformers

💻 Usage Examples

Basic Usage (Sentence-Transformers)

from sentence_transformers import SentenceTransformer, util

query = "How many people live in London?"
docs = ["Around 9 Million people live in London", "London is known for its financial district"]

#Load the model
model = SentenceTransformer('sentence-transformers/multi-qa-mpnet-base-dot-v1')

#Encode query and documents
query_emb = model.encode(query)
doc_emb = model.encode(docs)

#Compute dot score between query and all document embeddings
scores = util.dot_score(query_emb, doc_emb)[0].cpu().tolist()

#Combine docs & scores
doc_score_pairs = list(zip(docs, scores))

#Sort by decreasing score
doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True)

#Output passages & scores
for doc, score in doc_score_pairs:
    print(score, doc)

Advanced Usage (HuggingFace Transformers)

from transformers import AutoTokenizer, AutoModel
import torch

#CLS Pooling - Take output from first token
def cls_pooling(model_output):
    return model_output.last_hidden_state[:,0]

#Encode text
def encode(texts):
    # Tokenize sentences
    encoded_input = tokenizer(texts, padding=True, truncation=True, return_tensors='pt')

    # Compute token embeddings
    with torch.no_grad():
        model_output = model(**encoded_input, return_dict=True)

    # Perform pooling
    embeddings = cls_pooling(model_output)

    return embeddings


# Sentences we want sentence embeddings for
query = "How many people live in London?"
docs = ["Around 9 Million people live in London", "London is known for its financial district"]

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/multi-qa-mpnet-base-dot-v1")
model = AutoModel.from_pretrained("sentence-transformers/multi-qa-mpnet-base-dot-v1")

#Encode query and docs
query_emb = encode(query)
doc_emb = encode(docs)

#Compute dot score between query and all document embeddings
scores = torch.mm(query_emb, doc_emb.transpose(0, 1))[0].cpu().tolist()

#Combine docs & scores
doc_score_pairs = list(zip(docs, scores))

#Sort by decreasing score
doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True)

#Output passages & scores
for doc, score in doc_score_pairs:
    print(score, doc)

🔧 Technical Details

Property	Details
Dimensions	768
Produces normalized embeddings	No
Pooling-Method	CLS pooling
Suitable score functions	dot-product (e.g. `util.dot_score`)

📚 Documentation

Background

The project aims to train sentence embedding models on very large sentence level datasets using a self-supervised contrastive learning objective. The model should predict which out of a set of randomly sampled other sentences, was actually paired with a given sentence in the dataset.

This model was developed during the Community week using JAX/Flax for NLP & CV, organized by Hugging Face, as part of the project Train the Best Sentence Embedding Model Ever with 1B Training Pairs. The project benefited from efficient hardware infrastructure (7 TPUs v3-8) and the intervention from Google's Flax, JAX, and Cloud team members about efficient deep learning frameworks.

Intended uses

Our model is intended to be used for semantic search. It encodes queries / questions and text paragraphs in a dense vector space and finds relevant documents for the given passages.

⚠️ Important Note

There is a limit of 512 word pieces. Text longer than that will be truncated. Also, the model was just trained on input text up to 250 word pieces, so it might not work well for longer text.

Training procedure

The full training script is accessible in this current repository: train_script.py.

Pre-training

We use the pretrained mpnet-base model. Please refer to the model card for more detailed information about the pre-training procedure.

Training

We use the concatenation from multiple datasets to fine-tune our model. In total, we have about 215M (question, answer) pairs. Each dataset was sampled given a weighted probability, and the configuration is detailed in the data_config.json file.

The model was trained with MultipleNegativesRankingLoss using CLS-pooling, dot-product as the similarity function, and a scale of 1.

Dataset	Number of training tuples
WikiAnswers Duplicate question pairs from WikiAnswers	77,427,422
PAQ Automatically generated (Question, Paragraph) pairs for each paragraph in Wikipedia	64,371,441
Stack Exchange (Title, Body) pairs from all StackExchanges	25,316,456
Stack Exchange (Title, Answer) pairs from all StackExchanges	21,396,559
MS MARCO Triplets (query, answer, hard_negative) for 500k queries from Bing search engine	17,579,773
GOOAQ: Open Question Answering with Diverse Answer Types (query, answer) pairs for 3M Google queries and Google featured snippet	3,012,496
Amazon-QA (Question, Answer) pairs from Amazon product pages	2,448,839
Yahoo Answers (Title, Answer) pairs from Yahoo Answers	1,198,260
Yahoo Answers (Question, Answer) pairs from Yahoo Answers	681,164
Yahoo Answers (Title, Question) pairs from Yahoo Answers	659,896
SearchQA (Question, Answer) pairs for 140k questions, each with Top5 Google snippets on that question	582,261
ELI5 (Question, Answer) pairs from Reddit ELI5 (explainlikeimfive)	325,475
Stack Exchange Duplicate questions pairs (titles)	304,525
Quora Question Triplets (Question, Duplicate_Question, Hard_Negative) triplets for Quora Questions Pairs dataset	103,663
Natural Questions (NQ) (Question, Paragraph) pairs for 100k real Google queries with relevant Wikipedia paragraph	100,231
SQuAD2.0 (Question, Paragraph) pairs from SQuAD2.0 dataset	87,599
TriviaQA (Question, Evidence) pairs	73,346
Total	214,988,242

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご