All-t5-base-v1 Open-source Model - A Practical Tool for Document Expansion and Training Data Generation

Home

All T5 Base V1

Developed by doc2query

T5-based doc2query model for document expansion and training data generation

Text Generation

Transformers

EnglishOpen Source License:Apache-2.0 #Document Expansion #Query Generation #Semantic Search

Downloads 171

Release Time : 3/2/2022

Model Overview

This model is based on the T5 architecture, primarily used for document expansion and domain-specific training data generation. It can generate relevant queries for input text to help improve search engine performance or generate training data.

Model Features

Document Expansion

Can generate 20-40 relevant queries for a paragraph to help improve search engine performance

Training Data Generation

Can be used to generate domain-specific training data for training efficient dense embedding models

Multi-domain Adaptability

Training data covers multiple domains including Reddit, StackExchange, Amazon reviews, etc.

Model Capabilities

Text generation

Query generation

Document expansion

Training data generation

Use Cases

Search Engine Optimization

BM25 Index Enhancement

Index generated queries along with original documents to improve search engine performance

Proven to significantly improve search performance in BEIR evaluations

Machine Learning Training

Embedding Model Training

Generate (query, text) pairs for training dense embedding models

Can be used to train efficient semantic search models

🚀 doc2query/all-t5-base-v1

This is a doc2query model based on T5 (also known as docT5query), which can be used for document expansion and domain-specific training data generation.

🚀 Quick Start

This model can be used for the following purposes:

Document expansion: Generate 20 - 40 queries for your paragraphs and index both the paragraphs and the generated queries in a standard BM25 index like Elasticsearch, OpenSearch, or Lucene. The generated queries help to close the lexical gap of lexical search as they contain synonyms. Additionally, it re - weights words, giving important words a higher weight even if they appear seldom in a paragraph. In our BEIR paper, we showed that BM25 + docT5query is a powerful search engine. In the BEIR repository, there is an example of how to use docT5query with Pyserini.
Domain Specific Training Data Generation: It can generate training data to learn an embedding model. On SBERT.net, there is an example of using the model to generate (query, text) pairs for a given collection of unlabeled texts. These pairs can then be used to train powerful dense embedding models.

✨ Features

Document Expansion: Helps improve search performance by generating relevant queries.
Training Data Generation: Enables the creation of training data for embedding models.

📦 Installation

N/A

💻 Usage Examples

Basic Usage

from transformers import T5Tokenizer, T5ForConditionalGeneration

model_name = 'doc2query/all-t5-base-v1'
tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)

text = "Python is an interpreted, high-level and general-purpose programming language. Python's design philosophy emphasizes code readability with its notable use of significant whitespace. Its language constructs and object-oriented approach aim to help programmers write clear, logical code for small and large-scale projects."


input_ids = tokenizer.encode(text, max_length=384, truncation=True, return_tensors='pt')
outputs = model.generate(
    input_ids=input_ids,
    max_length=64,
    do_sample=True,
    top_p=0.95,
    num_return_sequences=5)

print("Text:")
print(text)

print("\nGenerated Queries:")
for i in range(len(outputs)):
    query = tokenizer.decode(outputs[i], skip_special_tokens=True)
    print(f'{i + 1}: {query}')

Note: model.generate() is non - deterministic. It produces different queries each time you run it.

📚 Documentation

Training

This model fine - tuned [google/t5 - v1_1 - base](https://huggingface.co/google/t5 - v1_1 - base) for 570k training steps. For the training script, see the train_script.py in this repository.

The input - text was truncated to 384 word pieces. Output text was generated up to 64 word pieces.

This model was trained on a large collection of datasets. For the exact datasets names and weights, see the data_config.json in this repository. Most of the datasets are available at [https://huggingface.co/sentence - transformers](https://huggingface.co/sentence - transformers).

The datasets include:

(title, body) pairs from [Reddit](https://huggingface.co/datasets/sentence - transformers/reddit - title - body)
(title, body) pairs and (title, answer) pairs from StackExchange and Yahoo Answers!
(title, review) pairs from Amazon reviews
(query, paragraph) pairs from MS MARCO, NQ, and GooAQ
(question, duplicate_question) from Quora and WikiAnswers
(title, abstract) pairs from S2ORC

Prefix

This model was trained without a prefix. In contrast to [doc2query/all - with_prefix - t5 - base - v1](https://huggingface.co/doc2query/all - with_prefix - t5 - base - v1), you cannot specify what type of transformation (answer2question, review2title) etc. you will have. This can lead to a mixture of output values.

🔧 Technical Details

Model Architecture: Based on T5 architecture.
Training Steps: 570k steps.
Input/Output Truncation: Input text truncated to 384 word pieces, output text generated up to 64 word pieces.

📄 License

This project is licensed under the Apache-2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご