🚀 基於T5的文檔查詢擴展模型
這是一個基於T5-base
的Doc2Query模型,在MS MARCO數據集上進行訓練。此版本是原作者發佈的檢查點轉換為PyTorch格式後的版本,可直接在pyterrier_doc2query
中使用。
🚀 快速開始
創建轉換器
import pyterrier as pt
pt.init()
from pyterrier_doc2query import Doc2Query
doc2query = Doc2Query('macavaney/doc2query-t5-base-msmarco')
轉換文檔
import pandas as pd
doc2query(pd.DataFrame([
{'docno': '0', 'text': 'Hello Terrier!'},
{'docno': '1', 'text': 'Doc2Query expands queries with potentially relevant queries.'},
]))
對轉換後的文檔進行索引
doc2query.append = True
indexer = pt.IterDictIndexer('./my_index', fields=['text'])
pipeline = doc2query >> indexer
pipeline.index([
{'docno': '0', 'text': 'Hello Terrier!'},
{'docno': '1', 'text': 'Doc2Query expands queries with potentially relevant queries.'},
])
擴展並索引數據集
dataset = pt.get_dataset('irds:vaswani')
pipeline.index(dataset.get_corpus_iter())
📚 詳細文檔
模型信息
屬性 |
詳情 |
模型類型 |
基於t5-base 的Doc2Query模型 |
訓練數據 |
MS MARCO |
庫名稱 |
transformers |
示例數據
- msmarco-passage:"The presence of communication amid scientific minds was equally important to the success of the Manhattan Project as scientific intellect was. The only cloud hanging over the impressive achievement of the atomic researchers and engineers is what their success truly meant; hundreds of thousands of innocent lives obliterated."
- msmarco-passage-v2:"0-60 Times - 0-60 | 0 to 60 Times & 1/4 Mile Times | Zero to 60 Car Reviews."
- antique:"A small group of politicians believed strongly that the fact that Saddam Hussien remained in power after the first Gulf War was a signal of weakness to the rest of the world, one that invited attacks and terrorism. Shortly after taking power with George Bush in 2000 and after the attack on 9/11, they were able to use the terrorist attacks to justify war with Iraq on this basis and exaggerated threats of the development of weapons of mass destruction. The military strength of the U.S. and the brutality of Saddam's regime led them to imagine that the military and political victory would be relatively easy."
📖 參考文獻