MultiTabQA-base-geoquery开源表格问答模型 - 支持多表操作生成答案表格

首页

Multitabqa Base Geoquery

由 vaishali 开发

MultiTabQA是一个表格问答模型，能够从多个输入表格生成答案表格，支持多表操作如UNION、INTERSECT、EXCEPT、JOINS等。

问答系统

Transformers

英语开源协议:MIT #多表问答 #表格生成 #SQL操作

下载量 14

发布时间 : 7/18/2023

模型简介

MultiTabQA基于TAPEX(BART)架构，包含双向编码器和自回归解码器，专门用于处理涉及多个表格的自然语言问题。

模型特点

多表操作支持

能够处理UNION、INTERSECT、EXCEPT、JOINS等多表操作

表格生成能力

不仅能回答问题，还能生成表格形式的答案

预训练数据集

使用包含132,645条SQL查询和表格答案的预训练数据集

模型能力

多表问答

表格生成

SQL操作执行

使用案例

数据库查询

部门负责人分析

分析哪些部门的负责人未被提及

可以生成包含统计结果的表格

商业智能

跨表数据分析

从多个相关表格中提取并整合信息

生成整合后的数据表格

🚀 MultiTabQA (基础规模模型)

MultiTabQA是一个用于多表问答的模型，它能从多个输入表中生成答案表，解决了传统单表问答模型在处理复杂多表查询时的局限性，为复杂的表格数据查询提供了有效的解决方案。

🚀 快速开始

模型描述

MultiTabQA由Vaishali Pal、Andrew Yates、Evangelos Kanoulas和Maarten de Rijke在论文 MultiTabQA: Generating Tabular Answers for Multi-Table Question Answering 中提出。原始代码仓库可在这里找到。

MultiTabQA是一个表格问答（tableQA）模型，可从多个输入表中生成答案表。它能够处理多表操作符，如UNION、INTERSECT、EXCEPT、JOINS等。该模型基于TAPEX（BART）架构，包含一个双向（类似BERT）的编码器和一个自回归（类似GPT）的解码器。

预期用途

你可以使用该原始模型对多个输入表执行SQL查询。该模型已在GeoQuery数据集上进行了微调，能够回答关于多个输入表的自然语言问题。

使用方法

以下是在transformers库中使用该模型的示例代码：

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import pandas as pd

tokenizer = AutoTokenizer.from_pretrained("vaishali/multitabqa-base-geoquery")
model = AutoModelForSeq2SeqLM.from_pretrained("vaishali/multitabqa-base-geoquery")

question = "How many departments are led by heads who are not mentioned?"
table_names = ['department', 'management']
tables=[{"columns":["Department_ID","Name","Creation","Ranking","Budget_in_Billions","Num_Employees"],
                  "index":[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14],
                  "data":[
                          [1,"State","1789",1,9.96,30266.0],
                          [2,"Treasury","1789",2,11.1,115897.0],
                          [3,"Defense","1947",3,439.3,3000000.0],
                          [4,"Justice","1870",4,23.4,112557.0],
                          [5,"Interior","1849",5,10.7,71436.0],
                          [6,"Agriculture","1889",6,77.6,109832.0],
                          [7,"Commerce","1903",7,6.2,36000.0],
                          [8,"Labor","1913",8,59.7,17347.0],
                          [9,"Health and Human Services","1953",9,543.2,67000.0],
                          [10,"Housing and Urban Development","1965",10,46.2,10600.0],
                          [11,"Transportation","1966",11,58.0,58622.0],
                          [12,"Energy","1977",12,21.5,116100.0],
                          [13,"Education","1979",13,62.8,4487.0],
                          [14,"Veterans Affairs","1989",14,73.2,235000.0],
                          [15,"Homeland Security","2002",15,44.6,208000.0]
                        ]
                  },
                  {"columns":["department_ID","head_ID","temporary_acting"],
                    "index":[0,1,2,3,4],
                    "data":[
                            [2,5,"Yes"],
                            [15,4,"Yes"],
                            [2,6,"Yes"],
                            [7,3,"No"],
                            [11,10,"No"]
                          ]
                  }]

input_tables = [pd.read_json(table, orient="split") for table in tables]

# flatten the model inputs in the format: query + " " + <table_name> : table_name1 + flattened_table1 + <table_name> : table_name2 + flattened_table2 + ...  
#flattened_input = question + " " + [f"<table_name> : {table_name} linearize_table(table) for table_name, table in zip(table_names, tables)]
model_input_string = """How many departments are led by heads who are not mentioned? <table_name> : department col : Department_ID | Name | Creation | Ranking | Budget_in_Billions | Num_Employees row 1 : 1 | State | 1789 | 1 | 9.96 | 30266 row 2 : 2 | Treasury | 1789 | 2 | 11.1 | 115897 row 3 : 3 | Defense | 1947 | 3 | 439.3 | 3000000 row 4 : 4 | Justice | 1870 | 4 | 23.4 | 112557 row 5 : 5 | Interior | 1849 | 5 | 10.7 | 71436 row 6 : 6 | Agriculture | 1889 | 6 | 77.6 | 109832 row 7 : 7 | Commerce | 1903 | 7 | 6.2 | 36000 row 8 : 8 | Labor | 1913 | 8 | 59.7 | 17347 row 9 : 9 | Health and Human Services | 1953 | 9 | 543.2 | 67000 row 10 : 10 | Housing and Urban Development | 1965 | 10 | 46.2 | 10600 row 11 : 11 | Transportation | 1966 | 11 | 58.0 | 58622 row 12 : 12 | Energy | 1977 | 12 | 21.5 | 116100 row 13 : 13 | Education | 1979 | 13 | 62.8 | 4487 row 14 : 14 | Veterans Affairs | 1989 | 14 | 73.2 | 235000 row 15 : 15 | Homeland Security | 2002 | 15 | 44.6 | 208000 <table_name> : management col : department_ID | head_ID | temporary_acting row 1 : 2 | 5 | Yes row 2 : 15 | 4 | Yes row 3 : 2 | 6 | Yes row 4 : 7 | 3 | No row 5 : 11 | 10 | No"""
inputs = tokenizer(model_input_string, return_tensors="pt")

outputs = model.generate(**inputs)

print(tokenizer.batch_decode(outputs, skip_special_tokens=True))
# 'col : count(*) row 1 : 11'

微调方法

请在这里找到微调脚本。

BibTeX引用和引用信息

@inproceedings{pal-etal-2023-multitabqa,
    title = "{M}ulti{T}ab{QA}: Generating Tabular Answers for Multi-Table Question Answering",
    author = "Pal, Vaishali  and
      Yates, Andrew  and
      Kanoulas, Evangelos  and
      de Rijke, Maarten",
    booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = jul,
    year = "2023",
    address = "Toronto, Canada",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.acl-long.348",
    doi = "10.18653/v1/2023.acl-long.348",
    pages = "6322--6334",
    abstract = "Recent advances in tabular question answering (QA) with large language models are constrained in their coverage and only answer questions over a single table. However, real-world queries are complex in nature, often over multiple tables in a relational database or web page. Single table questions do not involve common table operations such as set operations, Cartesian products (joins), or nested queries. Furthermore, multi-table operations often result in a tabular output, which necessitates table generation capabilities of tabular QA models. To fill this gap, we propose a new task of answering questions over multiple tables. Our model, MultiTabQA, not only answers questions over multiple tables, but also generalizes to generate tabular answers. To enable effective training, we build a pre-training dataset comprising of 132,645 SQL queries and tabular answers. Further, we evaluate the generated tables by introducing table-specific metrics of varying strictness assessing various levels of granularity of the table structure. MultiTabQA outperforms state-of-the-art single table QA models adapted to a multi-table QA setting by finetuning on three datasets: Spider, Atis and GeoQuery.",
}