๐ DistilBert for Dense Passage Retrieval trained with Balanced Topic Aware Sampling (TAS-B)
This project offers a retrieval-trained DistilBert-based model (we term the dual-encoder then dot-product scoring architecture BERT_Dot). It is trained with Balanced Topic Aware Sampling on MSMARCO-Passage. This model can be used for re-ranking a candidate set or direct dense retrieval based on a vector index.
โจ Features
- Training Configuration: Trained with a batch size of 256.
- Model Architecture: A 6-layer DistilBERT without any architectural additions or modifications (only the weights are changed during training). The CLS vector is pooled to obtain query/passage representations. The same BERT layers are used for both query and passage encoding, which yields better results and reduces memory requirements.
- Efficient Training: The batch composition procedure and dual supervision for dense retrieval training are efficient and can be completed on a single consumer GPU in 48 hours.
๐ Documentation
Effectiveness on MSMARCO Passage & TREC-DL'19
The model is trained on the MSMARCO standard ("small"-400K query) training triples re-sampled using the TAS-B method. The BERT_CAT pairwise scores and the ColBERT model are used as teacher models for in-batch-negative signals.
MSMARCO-DEV (7K)
|
MRR@10 |
NDCG@10 |
Recall@1K |
BM25 |
.194 |
.241 |
.857 |
TAS-B BERT_Dot (Retrieval) |
.347 |
.410 |
.978 |
TREC-DL'19
For MRR and Recall, the recommended binarization point of the graded relevance of 2 is used, which might affect the results compared to other binarization points.
|
MRR@10 |
NDCG@10 |
Recall@1K |
BM25 |
.689 |
.501 |
.739 |
TAS-B BERT_Dot (Retrieval) |
.883 |
.717 |
.843 |
TREC-DL'20
Similarly, for MRR and Recall, the recommended binarization point of the graded relevance of 2 is used.
|
MRR@10 |
NDCG@10 |
Recall@1K |
BM25 |
.649 |
.475 |
.806 |
TAS-B BERT_Dot (Retrieval) |
.843 |
.686 |
.875 |
For more baselines, information, and analysis, please refer to the paper: Efficiently Teaching an Effective Dense Retriever with Balanced Topic Aware Sampling
Limitations & Bias
- Social Biases: The model inherits social biases from both DistilBERT and MSMARCO.
- Text Length Limitation: Trained only on relatively short passages of MSMARCO (average 60 words in length), so it may have difficulties with longer text.
Citation
If you use our model checkpoint, please cite our work as follows:
@inproceedings{Hofstaetter2021_tasb_dense_retrieval,
author = {Sebastian Hofst{\"a}tter and Sheng-Chieh Lin and Jheng-Hong Yang and Jimmy Lin and Allan Hanbury},
title = {{Efficiently Teaching an Effective Dense Retriever with Balanced Topic Aware Sampling}},
booktitle = {Proc. of SIGIR},
year = {2021},
}
๐ Quick Start
For more information and a minimal usage example, please visit: tas-balanced-dense-retrieval
๐ก Usage Tip
If you want to know more about our efficient batch composition procedure and dual supervision for dense retrieval training, check out our paper: Efficiently Teaching an Effective Dense Retriever with Balanced Topic Aware Sampling ๐