msmarco-t5-base-v1開源模型 - 免費實現文檔擴展與訓練數據生成

首頁

Msmarco T5 Base V1

由doc2query開發

基於T5的doc2query模型，用於文檔擴展和訓練數據生成

文本生成

Transformers

英語開源協議:Apache-2.0 #文檔擴展 #查詢生成 #語義檢索增強

下載量 112

發布時間 : 3/2/2022

模型概述

該模型基於T5架構，主要用於文檔擴展和領域專用訓練數據生成。它能夠為輸入文本生成多個相關查詢，提升檢索系統的性能。

模型特點

文檔擴展

可為段落生成20-40個查詢，將段落與生成查詢共同索引，提升檢索效果

訓練數據生成

可用於生成嵌入模型的訓練數據，為未標註文本生成（查詢，文本）對

語義鴻溝彌補

通過生成查詢包含同義詞，彌補詞彙檢索的語義鴻溝

模型能力

文本生成

查詢生成

文檔擴展

使用案例

信息檢索

搜索引擎優化

將生成查詢與原始文檔共同索引，提升BM25檢索效果

在BEIR基準測試中驗證了其作為強大搜索引擎的效果

機器學習

訓練數據生成

為未標註文本生成（查詢，文本）對，用於訓練稠密嵌入模型

🚀 doc2query/msmarco-t5-base-v1

這是一個基於T5的doc2query模型（也稱為docT5query）。該模型可用於解決文檔搜索中的詞彙差距問題，以及生成特定領域的訓練數據，助力訓練強大的密集嵌入模型。

🚀 快速開始

本模型可用於以下兩個主要場景：

文檔擴展：為段落生成20 - 40個查詢，並將段落和生成的查詢索引到標準的BM25索引（如Elasticsearch、OpenSearch或Lucene）中。生成的查詢有助於縮小詞彙搜索的詞彙差距，因為生成的查詢包含同義詞。此外，它會重新加權單詞，即使重要單詞在段落中很少出現，也會賦予更高的權重。在我們的BEIR論文中，我們證明了BM25 + docT5query是一個強大的搜索引擎。在BEIR倉庫中，我們有一個如何使用docT5query與Pyserini的示例。
特定領域訓練數據生成：可用於生成訓練數據以學習嵌入模型。在SBERT.net上，我們有一個如何使用該模型為給定的未標記文本集合生成（查詢，文本）對的示例。這些對可用於訓練強大的密集嵌入模型。

💻 使用示例

基礎用法

from transformers import T5Tokenizer, T5ForConditionalGeneration

model_name = 'doc2query/msmarco-t5-base-v1'
tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)

text = "Python is an interpreted, high-level and general-purpose programming language. Python's design philosophy emphasizes code readability with its notable use of significant whitespace. Its language constructs and object-oriented approach aim to help programmers write clear, logical code for small and large-scale projects."

input_ids = tokenizer.encode(text, max_length=320, truncation=True, return_tensors='pt')
outputs = model.generate(
    input_ids=input_ids,
    max_length=64,
    do_sample=True,
    top_p=0.95,
    num_return_sequences=5)

print("Text:")
print(text)

print("\nGenerated Queries:")
for i in range(len(outputs)):
    query = tokenizer.decode(outputs[i], skip_special_tokens=True)
    print(f'{i + 1}: {query}')