Multitabqa Base
模型简介
MultiTabQA基于TAPEX(BART)架构,包含双向编码器和自回归解码器,专门用于处理涉及多个表格的自然语言问题。
模型特点
多表操作支持
能够处理涉及多个表格的复杂操作,如UNION、INTERSECT、EXCEPT和JOINS等。
表格生成能力
不仅能回答问题,还能生成结构化的表格作为答案输出。
基于TAPEX架构
采用TAPEX(BART)架构,结合双向编码和自回归解码的优势。
模型能力
多表问答
表格生成
SQL查询执行
使用案例
数据库查询
跨表统计查询
回答涉及多个表格的统计问题,如'有多少个部门由未提及的负责人领导?'
生成包含统计结果的表格
复杂关系查询
处理涉及多个表格关系的复杂查询
生成反映表格关系的查询结果
🚀 多表问答模型(MultiTabQA - 基础规模模型)
MultiTabQA是一个用于多表问答的模型,能够根据多个输入表格生成答案表格,解决了现实中涉及多表操作的复杂查询问题,为表格问答领域提供了更强大的解决方案。
🚀 快速开始
MultiTabQA由Vaishali Pal、Andrew Yates、Evangelos Kanoulas和Maarten de Rijke等人在论文MultiTabQA: Generating Tabular Answers for Multi-Table Question Answering中提出。原始代码仓库可在此处找到。
✨ 主要特性
- MultiTabQA是一个表格问答(tableQA)模型,可根据多个输入表格生成答案表格。
- 能够处理多表操作符,如UNION、INTERSECT、EXCEPT、JOINS等。
- 基于TAPEX(BART)架构,包含一个双向(类似BERT)的编码器和一个自回归(类似GPT)的解码器。
📦 安装指南
文档未提及安装步骤,暂不展示。
💻 使用示例
基础用法
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import pandas as pd
tokenizer = AutoTokenizer.from_pretrained("vaishali/multitabqa-base")
model = AutoModelForSeq2SeqLM.from_pretrained("vaishali/multitabqa-base")
question = "How many departments are led by heads who are not mentioned?"
table_names = ['department', 'management']
tables=[{"columns":["Department_ID","Name","Creation","Ranking","Budget_in_Billions","Num_Employees"],
"index":[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14],
"data":[
[1,"State","1789",1,9.96,30266.0],
[2,"Treasury","1789",2,11.1,115897.0],
[3,"Defense","1947",3,439.3,3000000.0],
[4,"Justice","1870",4,23.4,112557.0],
[5,"Interior","1849",5,10.7,71436.0],
[6,"Agriculture","1889",6,77.6,109832.0],
[7,"Commerce","1903",7,6.2,36000.0],
[8,"Labor","1913",8,59.7,17347.0],
[9,"Health and Human Services","1953",9,543.2,67000.0],
[10,"Housing and Urban Development","1965",10,46.2,10600.0],
[11,"Transportation","1966",11,58.0,58622.0],
[12,"Energy","1977",12,21.5,116100.0],
[13,"Education","1979",13,62.8,4487.0],
[14,"Veterans Affairs","1989",14,73.2,235000.0],
[15,"Homeland Security","2002",15,44.6,208000.0]
]
},
{"columns":["department_ID","head_ID","temporary_acting"],
"index":[0,1,2,3,4],
"data":[
[2,5,"Yes"],
[15,4,"Yes"],
[2,6,"Yes"],
[7,3,"No"],
[11,10,"No"]
]
}]
input_tables = [pd.read_json(table, orient="split") for table in tables]
# flatten the model inputs in the format: query + " " + <table_name> : table_name1 + flattened_table1 + <table_name> : table_name2 + flattened_table2 + ...
#flattened_input = question + " " + [f"<table_name> : {table_name} linearize_table(table) for table_name, table in zip(table_names, tables)]
model_input_string = """How many departments are led by heads who are not mentioned? <table_name> : department col : Department_ID | Name | Creation | Ranking | Budget_in_Billions | Num_Employees row 1 : 1 | State | 1789 | 1 | 9.96 | 30266 row 2 : 2 | Treasury | 1789 | 2 | 11.1 | 115897 row 3 : 3 | Defense | 1947 | 3 | 439.3 | 3000000 row 4 : 4 | Justice | 1870 | 4 | 23.4 | 112557 row 5 : 5 | Interior | 1849 | 5 | 10.7 | 71436 row 6 : 6 | Agriculture | 1889 | 6 | 77.6 | 109832 row 7 : 7 | Commerce | 1903 | 7 | 6.2 | 36000 row 8 : 8 | Labor | 1913 | 8 | 59.7 | 17347 row 9 : 9 | Health and Human Services | 1953 | 9 | 543.2 | 67000 row 10 : 10 | Housing and Urban Development | 1965 | 10 | 46.2 | 10600 row 11 : 11 | Transportation | 1966 | 11 | 58.0 | 58622 row 12 : 12 | Energy | 1977 | 12 | 21.5 | 116100 row 13 : 13 | Education | 1979 | 13 | 62.8 | 4487 row 14 : 14 | Veterans Affairs | 1989 | 14 | 73.2 | 235000 row 15 : 15 | Homeland Security | 2002 | 15 | 44.6 | 208000 <table_name> : management col : department_ID | head_ID | temporary_acting row 1 : 2 | 5 | Yes row 2 : 15 | 4 | Yes row 3 : 2 | 6 | Yes row 4 : 7 | 3 | No row 5 : 11 | 10 | No"""
inputs = tokenizer(model_input_string, return_tensors="pt")
outputs = model.generate(**inputs)
print(tokenizer.batch_decode(outputs, skip_special_tokens=True))
# 'col : count(*) row 1 : 11'
高级用法
文档未提及高级用法代码示例,暂不展示。
微调方法
微调脚本可在此处找到。
📚 详细文档
模型描述
MultiTabQA是一个表格问答模型,可根据多个输入表格生成答案表格。它能够处理多表操作符,如UNION、INTERSECT、EXCEPT、JOINS等。
MultiTabQA基于TAPEX(BART)架构,该架构包含一个双向(类似BERT)的编码器和一个自回归(类似GPT)的解码器。
预期用途
可以使用原始模型对多个输入表格执行SQL查询。该模型在Spider数据集上进行了微调,能够回答关于多个输入表格的自然语言问题。
信息表格
属性 | 详情 |
---|---|
模型类型 | 多表问答模型,基于TAPEX(BART)架构 |
训练数据 | vaishali/spider - tableQA数据集 |
引用信息
@inproceedings{pal-etal-2023-multitabqa,
title = "{M}ulti{T}ab{QA}: Generating Tabular Answers for Multi-Table Question Answering",
author = "Pal, Vaishali and
Yates, Andrew and
Kanoulas, Evangelos and
de Rijke, Maarten",
booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
month = jul,
year = "2023",
address = "Toronto, Canada",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.acl-long.348",
doi = "10.18653/v1/2023.acl-long.348",
pages = "6322--6334",
abstract = "Recent advances in tabular question answering (QA) with large language models are constrained in their coverage and only answer questions over a single table. However, real-world queries are complex in nature, often over multiple tables in a relational database or web page. Single table questions do not involve common table operations such as set operations, Cartesian products (joins), or nested queries. Furthermore, multi-table operations often result in a tabular output, which necessitates table generation capabilities of tabular QA models. To fill this gap, we propose a new task of answering questions over multiple tables. Our model, MultiTabQA, not only answers questions over multiple tables, but also generalizes to generate tabular answers. To enable effective training, we build a pre-training dataset comprising of 132,645 SQL queries and tabular answers. Further, we evaluate the generated tables by introducing table-specific metrics of varying strictness assessing various levels of granularity of the table structure. MultiTabQA outperforms state-of-the-art single table QA models adapted to a multi-table QA setting by finetuning on three datasets: Spider, Atis and GeoQuery.",
}
📄 许可证
本项目采用MIT许可证。
Distilbert Base Cased Distilled Squad
Apache-2.0
DistilBERT是BERT的轻量级蒸馏版本,参数量减少40%,速度提升60%,保留95%以上性能。本模型是在SQuAD v1.1数据集上微调的问答专用版本。
问答系统 英语
D
distilbert
220.76k
244
Distilbert Base Uncased Distilled Squad
Apache-2.0
DistilBERT是BERT的轻量级蒸馏版本,参数量减少40%,速度提升60%,在GLUE基准测试中保持BERT 95%以上的性能。本模型专为问答任务微调。
问答系统
Transformers 英语

D
distilbert
154.39k
115
Tapas Large Finetuned Wtq
Apache-2.0
TAPAS是基于BERT架构的表格问答模型,通过自监督方式在维基百科表格数据上预训练,支持对表格内容进行自然语言问答
问答系统
Transformers 英语

T
google
124.85k
141
T5 Base Question Generator
基于t5-base的问答生成模型,输入答案和上下文,输出相应问题
问答系统
Transformers

T
iarfmoose
122.74k
57
Bert Base Cased Qa Evaluator
基于BERT-base-cased的问答对评估模型,用于判断问题和答案是否语义相关
问答系统
B
iarfmoose
122.54k
9
Tiny Doc Qa Vision Encoder Decoder
MIT
一个基于MIT许可证的文档问答模型,主要用于测试目的。
问答系统
Transformers

T
fxmarty
41.08k
16
Dpr Question Encoder Single Nq Base
DPR(密集段落检索)是用于开放领域问答研究的工具和模型。该模型是基于BERT的问题编码器,使用自然问题(NQ)数据集训练。
问答系统
Transformers 英语

D
facebook
32.90k
30
Mobilebert Uncased Squad V2
MIT
MobileBERT是BERT_LARGE的轻量化版本,在SQuAD2.0数据集上微调而成的问答系统模型。
问答系统
Transformers 英语

M
csarron
29.11k
7
Tapas Base Finetuned Wtq
Apache-2.0
TAPAS是一个基于Transformer的表格问答模型,通过自监督学习在维基百科表格数据上预训练,并在WTQ等数据集上微调。
问答系统
Transformers 英语

T
google
23.03k
217
Dpr Question Encoder Multiset Base
基于BERT的密集段落检索(DPR)问题编码器,用于开放领域问答研究,在多个QA数据集上训练
问答系统
Transformers 英语

D
facebook
17.51k
4
精选推荐AI模型
Llama 3 Typhoon V1.5x 8b Instruct
专为泰语设计的80亿参数指令模型,性能媲美GPT-3.5-turbo,优化了应用场景、检索增强生成、受限生成和推理任务
大型语言模型
Transformers 支持多种语言

L
scb10x
3,269
16
Cadet Tiny
Openrail
Cadet-Tiny是一个基于SODA数据集训练的超小型对话模型,专为边缘设备推理设计,体积仅为Cosmo-3B模型的2%左右。
对话系统
Transformers 英语

C
ToddGoldfarb
2,691
6
Roberta Base Chinese Extractive Qa
基于RoBERTa架构的中文抽取式问答模型,适用于从给定文本中提取答案的任务。
问答系统 中文
R
uer
2,694
98