Open-source DPR Question Encoder Model - For Open-domain Question Answering, Trained on NQ Dataset

Dpr Question Encoder Single Nq Base

Developed by facebook

DPR (Dense Passage Retrieval) is a tool and model for open-domain question answering research. This model is a BERT-based question encoder trained on the Natural Questions (NQ) dataset.

Question Answering System

Transformers

English#Open-domain QA #Dense passage retrieval #BERT encoder

Downloads 32.90k

Release Time : 3/2/2022

Model Overview

This model is the question encoder in the DPR series, primarily used to encode natural language questions into vector representations for retrieving relevant passages in open-domain question answering systems.

Model Features

Efficient Retrieval

Encodes questions into low-dimensional vectors to support fast retrieval of relevant passages

Open-domain QA

Optimized for open-domain question answering tasks, capable of handling a wide range of natural language questions

BERT-based Architecture

Based on the proven BERT-base architecture with strong language understanding capabilities

Model Capabilities

Question vectorization

Semantic similarity calculation

Open-domain QA support

Use Cases

QA Systems

Open-domain QA

Building intelligent QA systems capable of answering questions across broad domains

Achieves 78.4% Top-20 accuracy on the NQ dataset

Information Retrieval

Semantic Retrieval

Document retrieval systems based on semantic rather than keyword matching

🚀 `dpr-question_encoder-single-nq-base`

dpr-question_encoder-single-nq-base is a question encoder in the Dense Passage Retrieval (DPR) framework, trained on the Natural Questions (NQ) dataset, which can be used for open - domain question answering tasks.

🚀 Quick Start

Use the code below to get started with the model.

from transformers import DPRQuestionEncoder, DPRQuestionEncoderTokenizer

tokenizer = DPRQuestionEncoderTokenizer.from_pretrained("facebook/dpr-question_encoder-single-nq-base")
model = DPRQuestionEncoder.from_pretrained("facebook/dpr-question_encoder-single-nq-base")
input_ids = tokenizer("Hello, is my dog cute ?", return_tensors="pt")["input_ids"]
embeddings = model(input_ids).pooler_output

✨ Features

Model Details

Model Description: Dense Passage Retrieval (DPR) is a set of tools and models for state - of - the - art open - domain Q&A research. dpr-question_encoder-single-nq-base is the question encoder trained using the Natural Questions (NQ) dataset (Lee et al., 2019; Kwiatkowski et al., 2019).
Developed by: See GitHub repo for model developers.
Model Type: BERT - based encoder
Language(s): English
License: CC - BY - NC - 4.0, also see Code of Conduct
Related Models:
Resources for more information:

Uses

Direct Use

dpr-question_encoder-single-nq-base, dpr-ctx_encoder-single-nq-base, and dpr-reader-single-nq-base can be used for the task of open - domain question answering.

Misuse and Out - of - scope Use

The model should not be used to intentionally create hostile or alienating environments for people. In addition, the set of DPR models was not trained to be factual or true representations of people or events, and therefore using the models to generate such content is out - of - scope for the abilities of this model.

Risks, Limitations and Biases

CONTENT WARNING: Readers should be aware this section may contain content that is disturbing, offensive, and can propagate historical and current stereotypes.

Significant research has explored bias and fairness issues with language models (see, e.g., Sheng et al., 2021 and Bender et al., 2021). Predictions generated by the model can include disturbing and harmful stereotypes across protected classes; identity characteristics; and sensitive, social, and occupational groups.

📚 Documentation

Training

Training Data

This model was trained using the Natural Questions (NQ) dataset (Lee et al., 2019; Kwiatkowski et al., 2019). The model authors write that:

[The dataset] was designed for end - to - end question answering. The questions were mined from real Google search queries and the answers were spans in Wikipedia articles identified by annotators.

Training Procedure

The training procedure is described in the associated paper:

Given a collection of M text passages, the goal of our dense passage retriever (DPR) is to index all the passages in a low - dimensional and continuous space, such that it can retrieve efficiently the top k passages relevant to the input question for the reader at run - time.

Our dense passage retriever (DPR) uses a dense encoder EP(·) which maps any text passage to a d - dimensional real - valued vectors and builds an index for all the M passages that we will use for retrieval. At run - time, DPR applies a different encoder EQ(·) that maps the input question to a d - dimensional vector, and retrieves k passages of which vectors are the closest to the question vector.

The authors report that for encoders, they used two independent BERT (Devlin et al., 2019) networks (base, un - cased) and use FAISS (Johnson et al., 2017) during inference time to encode and index passages. See the paper for further details on training, including encoders, inference, positive and negative passages, and in - batch negatives.

Evaluation

Testing Data, Factors and Metrics

The model developers report the performance of the model on five QA datasets, using the top - k accuracy (k ∈ {20, 100}). The datasets were NQ, TriviaQA, WebQuestions (WQ), CuratedTREC (TREC), and SQuAD v1.1.

Results

	Top 20					Top 100
	NQ	TriviaQA	WQ	TREC	SQuAD	NQ	TriviaQA	WQ	TREC	SQuAD
	78.4	79.4	73.2	79.8	63.2	85.4	85.0	81.4	89.1	77.2

Environmental Impact

Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019). We present the hardware type and based on the associated paper.

Property	Details
Hardware Type	8 32GB GPUs
Hours used	Unknown
Cloud Provider	Unknown
Compute Region	Unknown
Carbon Emitted	Unknown

Technical Specifications

See the associated paper for details on the modeling architecture, objective, compute infrastructure, and training details.

Citation Information

  @inproceedings{karpukhin-etal-2020-dense,
    title = "Dense Passage Retrieval for Open-Domain Question Answering",
    author = "Karpukhin, Vladimir and Oguz, Barlas and Min, Sewon and Lewis, Patrick and Wu, Ledell and Edunov, Sergey and Chen, Danqi and Yih, Wen-tau",
    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.emnlp-main.550",
    doi = "10.18653/v1/2020.emnlp-main.550",
    pages = "6769--6781",
}

Model Card Authors

This model card was written by the team at Hugging Face.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご