All-with_prefix-t5-base-v1 Open-source Model - Freely Achieve Document Expansion and Training Data Generation

All With Prefix T5 Base V1

Developed by doc2query

T5-based doc2query model for document expansion and training data generation

EnglishOpen Source License:Apache-2.0 #Document Expansion Query Generation #Domain Training Data Generation #Multi-prefix Controlled Output

Downloads 574

Release Time : 3/2/2022

Model Overview

This model, based on the T5 architecture, can generate relevant queries for documents to enhance search engine effectiveness or create training data.

Model Features

Multi-prefix Support

Supports multiple prefix inputs, enabling generation of different types of output text based on different prefixes

Document Expansion

Can generate 20-40 relevant queries for a document to help bridge the vocabulary gap in lexical search

Training Data Generation

Can be used to generate (query, text) pairs for unlabeled text to train embedding models

Model Capabilities

Text Generation

Query Generation

Document Expansion

Training Data Generation

Use Cases

Information Retrieval

Search Engine Enhancement

Index generated queries alongside original documents to improve BM25 search engine performance

Demonstrated significant improvement in search effectiveness in BEIR evaluations

Machine Learning

Embedding Model Training

Generate (query, text) pairs for training dense embedding models

🚀 doc2query/all-with_prefix-t5-base-v1

A doc2query model based on T5, also known as docT5query, with diverse applications in document expansion and training data generation.

🚀 Quick Start

This model can serve two main purposes:

Document expansion: Generate 20 - 40 queries for your paragraphs and index both the paragraphs and the generated queries in a standard BM25 index such as Elasticsearch, OpenSearch, or Lucene. The generated queries help bridge the lexical gap in lexical search as they contain synonyms. Additionally, it re - weights words, giving important words a higher weight even if they rarely appear in a paragraph. In our BEIR paper, we demonstrated that BM25 + docT5query is a powerful search engine. An example of using docT5query with Pyserini can be found in the BEIR repository.
Domain Specific Training Data Generation: Generate training data to learn an embedding model. On SBERT.net, there is an example of using the model to generate (query, text) pairs for a given collection of unlabeled texts. These pairs can then be used to train powerful dense embedding models.

✨ Features

Versatile Applications: Suitable for both document expansion and domain - specific training data generation.
Lexical Gap Bridging: Helps improve search performance by generating queries with synonyms.
Prefix - Based Output: Allows for different types of output based on specific prefixes.

📦 Installation

No specific installation steps are provided in the original README. This section is skipped.

💻 Usage Examples

Basic Usage

from transformers import T5Tokenizer, T5ForConditionalGeneration

model_name = 'doc2query/all-with_prefix-t5-base-v1'
tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)

prefix = "answer2question"
text = "Python is an interpreted, high - level and general - purpose programming language. Python's design philosophy emphasizes code readability with its notable use of significant whitespace. Its language constructs and object - oriented approach aim to help programmers write clear, logical code for small and large - scale projects."

text = prefix+": "+text

input_ids = tokenizer.encode(text, max_length=384, truncation=True, return_tensors='pt')
outputs = model.generate(
    input_ids=input_ids,
    max_length=64,
    do_sample=True,
    top_p=0.95,
    num_return_sequences=5)

print("Text:")
print(text)

print("\nGenerated Queries:")
for i in range(len(outputs)):
    query = tokenizer.decode(outputs[i], skip_special_tokens=True)
    print(f'{i + 1}: {query}')

Important Note

⚠️ Important Note

model.generate() is non - deterministic. It produces different queries each time you run it.

📚 Documentation

Training

This model fine - tuned [google/t5 - v1_1 - base](https://huggingface.co/google/t5 - v1_1 - base) for 575k training steps. For the training script, see the train_script.py in this repository.

The input - text was truncated to 384 word pieces. Output text was generated up to 64 word pieces.

This model was trained on a large collection of datasets. For the exact datasets names and weights see the data_config.json in this repository. Most of the datasets are available at [https://huggingface.co/sentence - transformers](https://huggingface.co/sentence - transformers).

The datasets include:

(title, body) pairs from [Reddit](https://huggingface.co/datasets/sentence - transformers/reddit - title - body)
(title, body) pairs and (title, answer) pairs from StackExchange and Yahoo Answers!
(title, review) pairs from Amazon reviews
(query, paragraph) pairs from MS MARCO, NQ, and GooAQ
(question, duplicate_question) from Quora and WikiAnswers
(title, abstract) pairs from S2ORC

Prefix

This model was trained with a prefix: You start the text with a specific index that defines what type of output text you would like to receive. Depending on the prefix, the output is different.

E.g. the above text about Python produces the following output:

Prefix	Output
answer2question	Why should I use python in my business? ; What is the difference between Python and.NET? ; what is the python design philosophy?
review2title	Python a powerful and useful language ; A new and improved programming language ; Object - oriented, practical and accessible
abstract2title	Python: A Software Development Platform ; A Research Guide for Python X: Conceptual Approach to Programming ; Python : Language and Approach
text2query	is python a low level language? ; what is the primary idea of python? ; is python a programming language?

These are all available pre - fixes:

text2reddit
question2title
answer2question
abstract2title
review2title
news2title
text2query
question2question

For the datasets and weights for the different pre - fixes see data_config.json in this repository.

📄 License

This project is licensed under the apache - 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご