🚀 SPAR Lexical Model (Λ)
This model serves as the context encoder of the Wiki BM25 Lexical Model (Λ) from the SPAR paper, designed to imitate the behavior of BM25 in dense retrieval.
🚀 Quick Start
This model is the context encoder of the Wiki BM25 Lexical Model (Λ) introduced in the SPAR paper:
Salient Phrase Aware Dense Retrieval: Can a Dense Retriever Imitate a Sparse One?
Xilun Chen, Kushal Lakhotia, Barlas Oğuz, Anchit Gupta, Patrick Lewis, Stan Peshterliev, Yashar Mehdad, Sonal Gupta and Wen-tau Yih
Meta AI
The associated github repo is available here: https://github.com/facebookresearch/dpr-scale/tree/main/spar
✨ Features
This model is a BERT - base sized dense retriever trained on Wikipedia articles to imitate the behavior of BM25. The following models are also available:
Pretrained Model |
Corpus |
Teacher |
Architecture |
Query Encoder Path |
Context Encoder Path |
Wiki BM25 Λ |
Wikipedia |
BM25 |
BERT - base |
facebook/spar - wiki - bm25 - lexmodel - query - encoder |
facebook/spar - wiki - bm25 - lexmodel - context - encoder |
PAQ BM25 Λ |
PAQ |
BM25 |
BERT - base |
facebook/spar - paq - bm25 - lexmodel - query - encoder |
facebook/spar - paq - bm25 - lexmodel - context - encoder |
MARCO BM25 Λ |
MS MARCO |
BM25 |
BERT - base |
facebook/spar - marco - bm25 - lexmodel - query - encoder |
facebook/spar - marco - bm25 - lexmodel - context - encoder |
MARCO UniCOIL Λ |
MS MARCO |
UniCOIL |
BERT - base |
facebook/spar - marco - unicoil - lexmodel - query - encoder |
facebook/spar - marco - unicoil - lexmodel - context - encoder |
💻 Usage Examples
Basic Usage
This model should be used together with the associated query encoder, similar to the DPR model.
import torch
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained('facebook/spar-wiki-bm25-lexmodel-query-encoder')
query_encoder = AutoModel.from_pretrained('facebook/spar-wiki-bm25-lexmodel-query-encoder')
context_encoder = AutoModel.from_pretrained('facebook/spar-wiki-bm25-lexmodel-context-encoder')
query = "Where was Marie Curie born?"
contexts = [
"Maria Sklodowska, later known as Marie Curie, was born on November 7, 1867.",
"Born in Paris on 15 May 1859, Pierre Curie was the son of Eugène Curie, a doctor of French Catholic origin from Alsace."
]
query_input = tokenizer(query, return_tensors='pt')
ctx_input = tokenizer(contexts, padding=True, truncation=True, return_tensors='pt')
query_emb = query_encoder(**query_input).last_hidden_state[:, 0, :]
ctx_emb = context_encoder(**ctx_input).last_hidden_state[:, 0, :]
score1 = query_emb @ ctx_emb[0]
score2 = query_emb @ ctx_emb[1]
Advanced Usage
As Λ learns lexical matching from a sparse teacher retriever, it can be used in combination with a standard dense retriever (e.g. DPR, Contriever) to build a dense retriever that excels at both lexical and semantic matching.
In the following example, we show how to build the SPAR - Wiki model for Open - Domain Question Answering by concatenating the embeddings of DPR and the Wiki BM25 Λ.
import torch
from transformers import AutoTokenizer, AutoModel
from transformers import DPRQuestionEncoder, DPRQuestionEncoderTokenizer
from transformers import DPRContextEncoder, DPRContextEncoderTokenizer
dpr_ctx_tokenizer = DPRContextEncoderTokenizer.from_pretrained("facebook/dpr-ctx_encoder-multiset-base")
dpr_ctx_encoder = DPRContextEncoder.from_pretrained("facebook/dpr-ctx_encoder-multiset-base")
dpr_query_tokenizer = DPRQuestionEncoderTokenizer.from_pretrained("facebook/dpr-question_encoder-multiset-base")
dpr_query_encoder = DPRQuestionEncoder.from_pretrained("facebook/dpr-question_encoder-multiset-base")
lexmodel_tokenizer = AutoTokenizer.from_pretrained('facebook/spar-wiki-bm25-lexmodel-query-encoder')
lexmodel_query_encoder = AutoModel.from_pretrained('facebook/spar-wiki-bm25-lexmodel-query-encoder')
lexmodel_context_encoder = AutoModel.from_pretrained('facebook/spar-wiki-bm25-lexmodel-context-encoder')
query = "Where was Marie Curie born?"
contexts = [
"Maria Sklodowska, later known as Marie Curie, was born on November 7, 1867.",
"Born in Paris on 15 May 1859, Pierre Curie was the son of Eugène Curie, a doctor of French Catholic origin from Alsace."
]
dpr_query_input = dpr_query_tokenizer(query, return_tensors='pt')['input_ids']
dpr_query_emb = dpr_query_encoder(dpr_query_input).pooler_output
dpr_ctx_input = dpr_ctx_tokenizer(contexts, padding=True, truncation=True, return_tensors='pt')
dpr_ctx_emb = dpr_ctx_encoder(**dpr_ctx_input).pooler_output
lexmodel_query_input = lexmodel_tokenizer(query, return_tensors='pt')
lexmodel_query_emb = lexmodel_query_encoder(**query_input).last_hidden_state[:, 0, :]
lexmodel_ctx_input = lexmodel_tokenizer(contexts, padding=True, truncation=True, return_tensors='pt')
lexmodel_ctx_emb = lexmodel_context_encoder(**ctx_input).last_hidden_state[:, 0, :]
concat_weight = 0.7
spar_query_emb = torch.cat(
[dpr_query_emb, concat_weight * lexmodel_query_emb],
dim=-1,
)
spar_ctx_emb = torch.cat(
[dpr_ctx_emb, lexmodel_ctx_emb],
dim=-1,
)
score1 = spar_query_emb @ spar_ctx_emb[0]
score2 = spar_query_emb @ spar_ctx_emb[1]