🚀 基于T5的文档查询扩展模型
这是一个基于T5-base
的Doc2Query模型,在MS MARCO数据集上进行训练。此版本是原作者发布的检查点转换为PyTorch格式后的版本,可直接在pyterrier_doc2query
中使用。
🚀 快速开始
创建转换器
import pyterrier as pt
pt.init()
from pyterrier_doc2query import Doc2Query
doc2query = Doc2Query('macavaney/doc2query-t5-base-msmarco')
转换文档
import pandas as pd
doc2query(pd.DataFrame([
{'docno': '0', 'text': 'Hello Terrier!'},
{'docno': '1', 'text': 'Doc2Query expands queries with potentially relevant queries.'},
]))
对转换后的文档进行索引
doc2query.append = True
indexer = pt.IterDictIndexer('./my_index', fields=['text'])
pipeline = doc2query >> indexer
pipeline.index([
{'docno': '0', 'text': 'Hello Terrier!'},
{'docno': '1', 'text': 'Doc2Query expands queries with potentially relevant queries.'},
])
扩展并索引数据集
dataset = pt.get_dataset('irds:vaswani')
pipeline.index(dataset.get_corpus_iter())
📚 详细文档
模型信息
属性 |
详情 |
模型类型 |
基于t5-base 的Doc2Query模型 |
训练数据 |
MS MARCO |
库名称 |
transformers |
示例数据
- msmarco-passage:"The presence of communication amid scientific minds was equally important to the success of the Manhattan Project as scientific intellect was. The only cloud hanging over the impressive achievement of the atomic researchers and engineers is what their success truly meant; hundreds of thousands of innocent lives obliterated."
- msmarco-passage-v2:"0-60 Times - 0-60 | 0 to 60 Times & 1/4 Mile Times | Zero to 60 Car Reviews."
- antique:"A small group of politicians believed strongly that the fact that Saddam Hussien remained in power after the first Gulf War was a signal of weakness to the rest of the world, one that invited attacks and terrorism. Shortly after taking power with George Bush in 2000 and after the attack on 9/11, they were able to use the terrorist attacks to justify war with Iraq on this basis and exaggerated threats of the development of weapons of mass destruction. The military strength of the U.S. and the brutality of Saddam's regime led them to imagine that the military and political victory would be relatively easy."
📖 参考文献