MSMARCO-T5 Base V1 Open-Source Model - Free for Document Expansion and Training Data Generation

Home

Msmarco T5 Base V1

Developed by doc2query

T5-based doc2query model for document expansion and training data generation

Text Generation

Transformers

EnglishOpen Source License:Apache-2.0 #Document Expansion #Query Generation #Semantic Retrieval Enhancement

Downloads 112

Release Time : 3/2/2022

Model Overview

This model is based on the T5 architecture, primarily used for document expansion and domain-specific training data generation. It can generate multiple relevant queries for input text to enhance retrieval system performance.

Model Features

Document Expansion

Can generate 20-40 queries for a paragraph, co-indexing the paragraph with generated queries to improve retrieval effectiveness

Training Data Generation

Can be used to generate training data for embedding models, creating (query, text) pairs for unlabeled text

Bridging Semantic Gaps

Generates queries containing synonyms to bridge semantic gaps in lexical retrieval

Model Capabilities

Text Generation

Query Generation

Document Expansion

Use Cases

Information Retrieval

Search Engine Optimization

Co-index generated queries with original documents to enhance BM25 retrieval performance

Validated as an effective search engine in the BEIR benchmark

Machine Learning

Training Data Generation

Generate (query, text) pairs for unlabeled text to train dense embedding models

🚀 doc2query/msmarco-t5-base-v1

This is a doc2query model based on T5 (also known as docT5query), which offers solutions for document expansion and domain-specific training data generation.

🚀 Quick Start

This model can be used in the following two main scenarios:

✨ Features

Document Expansion

You can generate 20 - 40 queries for your paragraphs and index both the paragraphs and the generated queries in a standard BM25 index such as Elasticsearch, OpenSearch, or Lucene. The generated queries help to bridge the lexical gap in lexical search as they contain synonyms. Moreover, it re - weights words, giving important words a higher weight even if they appear rarely in a paragraph. In our BEIR paper, we demonstrated that BM25 + docT5query is a powerful search engine. In the BEIR repository, there is an example of how to use docT5query with Pyserini.

Domain Specific Training Data Generation

It can be used to generate training data for learning an embedding model. On SBERT.net, we have an example of how to use the model to generate (query, text) pairs for a given collection of unlabeled texts. These pairs can then be used to train powerful dense embedding models.

💻 Usage Examples

Basic Usage

from transformers import T5Tokenizer, T5ForConditionalGeneration

model_name = 'doc2query/msmarco-t5-base-v1'
tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)

text = "Python is an interpreted, high - level and general - purpose programming language. Python's design philosophy emphasizes code readability with its notable use of significant whitespace. Its language constructs and object - oriented approach aim to help programmers write clear, logical code for small and large - scale projects."


input_ids = tokenizer.encode(text, max_length=320, truncation=True, return_tensors='pt')
outputs = model.generate(
    input_ids=input_ids,
    max_length=64,
    do_sample=True,
    top_p=0.95,
    num_return_sequences=5)

print("Text:")
print(text)

print("\nGenerated Queries:")
for i in range(len(outputs)):
    query = tokenizer.decode(outputs[i], skip_special_tokens=True)
    print(f'{i + 1}: {query}')

⚠️ Important Note

model.generate() is non - deterministic. It produces different queries each time you run it.

🔧 Technical Details

Training

This model fine - tuned [google/t5 - v1_1 - base](https://huggingface.co/google/t5 - v1_1 - base) for 31k training steps (about 4 epochs on the 500k training pairs from MS MARCO). For the training script, see the train_script.py in this repository.

The input - text was truncated to 320 word pieces. Output text was generated up to 64 word pieces.

This model was trained on a (query, passage) from the [MS MARCO Passage - Ranking dataset](https://github.com/microsoft/MSMARCO - Passage - Ranking).

📄 License

This model is licensed under the apache - 2.0 license.

Property	Details
Datasets	sentence - transformers/embedding - training - data
License	apache - 2.0

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご