MSMARCO-T5-Small-V1 Open-Source Model - For Document Expansion and Training Data Generation

Home

Msmarco T5 Small V1

Developed by doc2query

T5-based doc2query model for document expansion and training data generation

Text Generation

Transformers

EnglishOpen Source License:Apache-2.0 #Document Expansion #Query Generation #Vocabulary Search Optimization

Downloads 23

Release Time : 3/2/2022

Model Overview

This model is based on the T5 architecture and can generate relevant queries for input text, primarily used for document expansion and domain-specific training data generation.

Model Features

Document Expansion

Can generate 20-40 relevant queries for a paragraph to help bridge the vocabulary gap in lexical search

Training Data Generation

Can be used to generate (query, text) pairs for training powerful dense embedding models

Based on T5 Architecture

Fine-tuned using google/t5-v1_1-small model, with efficient text generation capabilities

Model Capabilities

Text Generation

Query Generation

Document Expansion

Training Data Generation

Use Cases

Information Retrieval

Search Engine Optimization

Generate relevant queries for documents and index them to improve the effectiveness of traditional BM25 search engines

Performs well on the BEIR benchmark

Machine Learning

Embedding Model Training

Generate (query, text) pairs as training data for training dense embedding models

🚀 doc2query/msmarco-t5-small-v1

This is a doc2query model based on T5 (also known as docT5query). It can be used for document expansion and generating domain-specific training data, enhancing search performance and facilitating the training of embedding models.

✨ Features

Document expansion: Generate 20 - 40 queries for paragraphs and index them in a standard BM25 index like Elasticsearch, OpenSearch, or Lucene. The generated queries close the lexical gap of lexical search with synonyms and re - weight important words. As shown in the BEIR paper, BM25 + docT5query is a powerful search engine. An example of using docT5query with Pyserini can be found in the BEIR repository.
Domain Specific Training Data Generation: Generate training data for learning an embedding model. On SBERT.net, there is an example of using the model to generate (query, text) pairs for unlabeled text collections, which can train powerful dense embedding models.

📦 Installation

No specific installation steps are provided in the original README.

💻 Usage Examples

Basic Usage

from transformers import T5Tokenizer, T5ForConditionalGeneration

model_name = 'doc2query/msmarco-t5-small-v1'
tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)

text = "Python is an interpreted, high-level and general-purpose programming language. Python's design philosophy emphasizes code readability with its notable use of significant whitespace. Its language constructs and object-oriented approach aim to help programmers write clear, logical code for small and large-scale projects."


input_ids = tokenizer.encode(text, max_length=320, truncation=True, return_tensors='pt')
outputs = model.generate(
    input_ids=input_ids,
    max_length=64,
    do_sample=True,
    top_p=0.95,
    num_return_sequences=5)

print("Text:")
print(text)

print("\nGenerated Queries:")
for i in range(len(outputs)):
    query = tokenizer.decode(outputs[i], skip_special_tokens=True)
    print(f'{i + 1}: {query}')

Note: model.generate() is non - deterministic. It produces different queries each time you run it.

📚 Documentation

This model fine - tuned [google/t5 - v1_1 - small](https://huggingface.co/google/t5 - v1_1 - small) for 31k training steps (about 4 epochs on the 500k training pairs from MS MARCO). For the training script, see the train_script.py in this repository.

The input - text was truncated to 320 word pieces. Output text was generated up to 64 word pieces.

This model was trained on a (query, passage) from the [MS MARCO Passage - Ranking dataset](https://github.com/microsoft/MSMARCO - Passage - Ranking).

📄 License

This project is licensed under the apache - 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご