Open-source dpr-question_encoder model - Optimize retrieval performance, designed specifically for long-form question-answering tasks

Dpr Question Encoder Single Lfqa Wiki

Developed by vblagoje

A DPR-based question encoder model specifically designed for long-form QA (LFQA) tasks, optimized for retrieval performance through two-stage training

Question Answering System

Transformers

EnglishOpen Source License:MIT #Long-form QA Retrieval #Wikipedia Knowledge Base #Two-stage Fine-tuning

Downloads 588

Release Time : 3/2/2022

Model Overview

This model uses Transformer's pooled output as question representations, primarily for retrieving relevant answer passages from large-scale knowledge bases for long-form questions

Model Features

Two-stage Training Strategy

First stage fine-tunes with LFQA datasets, second stage introduces Wikipedia indexing to construct higher-quality training samples

Hard Negative Optimization

Enhances model discrimination through carefully designed negative sample selection strategy (cosine similarity range 0.55-0.65)

Cross-encoder Enhancement

Second stage uses SBert cross-encoder to score candidate answers and filter high-quality positive/negative samples

Model Capabilities

Question vector encoding

Semantic similarity calculation

Open-domain retrieval

Long-form QA support

Use Cases

Knowledge Retrieval Systems

Wikipedia QA System

Retrieves the most relevant answer passages from Wikipedia for complex questions

Can replace traditional keyword retrieval, providing semantically better-matched results

Educational Assistance

Learning Assistant

Helps students retrieve detailed explanations of complex concepts in long-form answers

Provides more comprehensive knowledge explanations than simple Q&A

🚀 Question Encoder Model for Long - Form QA

A question encoder model based on the DPR architecture for long - form question answering, leveraging transformer pooler outputs for question representation.

🚀 Quick Start

The question encoder model is based on the DPRQuestionEncoder architecture. It uses the transformer's pooler outputs as question representations. For more details, refer to this blog post.

✨ Features

Two - stage Training: The model vblagoje/dpr - question_encoder - single - lfqa - wiki is trained in two stages using FAIR's dpr - scale.
Custom Training Data Creation: In each stage, custom DPR formatted training sets are created with positive, negative, and hard negative samples.
Performance Metrics: It has specific performance metrics on the KILT benchmark for long - form question answering.

📦 Installation

No specific installation steps are provided in the original README.

💻 Usage Examples

Basic Usage

from transformers import DPRContextEncoder, DPRContextEncoderTokenizer

model = DPRQuestionEncoder.from_pretrained("vblagoje/dpr-question_encoder-single-lfqa-wiki").to(device)
tokenizer = AutoTokenizer.from_pretrained("vblagoje/dpr-question_encoder-single-lfqa-wiki")

input_ids = tokenizer("Why do airplanes leave contrails in the sky?", return_tensors="pt")["input_ids"]
embeddings = model(input_ids).pooler_output

📚 Documentation

Training

We trained vblagoje/dpr - question_encoder - single - lfqa - wiki using FAIR's dpr - scale in two stages.

First Stage: We used a PAQ - based pretrained checkpoint and fine - tuned the retriever on the question - answer pairs from the LFQA dataset. Since dpr - scale requires a DPR - formatted training set input with positive, negative, and hard negative samples, we created a training file where an answer was the positive sample, negatives were answers unrelated to the question, and hard negative samples were chosen from answers on questions with a cosine similarity between 0.55 and 0.65.
Second Stage: We created a new DPR training set using positives, negatives, and hard negatives from the Wikipedia/Faiss index created in the first stage instead of LFQA dataset answers. More precisely, for each dataset question, we queried the first - stage Wikipedia Faiss index and then used an SBert cross - encoder to score question/answer (passage) pairs with topk = 50. The cross - encoder selected the positive passage with the highest score, while the bottom seven answers were selected for hard - negatives. Negative samples were again chosen to be answers unrelated to a given dataset question. After creating a DPR - formatted training file with Wikipedia - sourced positive, negative, and hard negative passages, we trained DPR - based question/passage encoders using dpr - scale.

Performance

The LFQA DPR - based retriever (vblagoje/dpr - question_encoder - single - lfqa - wiki and vblagoje/dpr - ctx_encoder - single - lfqa - wiki) slightly underperforms the 'state - of - the - art' Krishna et al. "Hurdles to Progress in Long - form Question Answering" REALM - based retriever with a KILT benchmark performance of 11.2 for R - precision and 19.5 for Recall@5.

🔧 Technical Details

The model uses the DPRQuestionEncoder architecture. It takes advantage of the transformer's pooler outputs for representing questions. The training process involves complex data preparation with positive, negative, and hard negative samples, and uses FAIR's dpr - scale for training.

📄 License

This project is licensed under the MIT license.

👨‍💻 Author

Vladimir Blagojevic: dovlex [at] gmail.com Twitter | LinkedIn

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご