Multitabqa - base開源表格問答模型 - 免費實現多表操作生成答案表格

首頁

Multitabqa Base

由vaishali開發

MultiTabQA是一個表格問答模型，能夠從多個輸入表中生成答案表格，支持多表操作如UNION、INTERSECT、EXCEPT、JOINS等。

問答系統

Transformers

英語開源協議:MIT #多表問答 #表格生成 #SQL查詢轉換

下載量 185

發布時間 : 7/5/2023

模型概述

MultiTabQA基於TAPEX（BART）架構，包含雙向編碼器和自迴歸解碼器，專門用於處理涉及多個表格的自然語言問題。

模型特點

多表操作支持

能夠處理涉及多個表格的複雜操作，如UNION、INTERSECT、EXCEPT和JOINS等。

表格生成能力

不僅能回答問題，還能生成結構化的表格作為答案輸出。

基於TAPEX架構

採用TAPEX（BART）架構，結合雙向編碼和自迴歸解碼的優勢。

模型能力

多表問答

表格生成

SQL查詢執行

使用案例

數據庫查詢

跨表統計查詢

回答涉及多個表格的統計問題，如'有多少個部門由未提及的負責人領導？'

生成包含統計結果的表格

複雜關係查詢

處理涉及多個表格關係的複雜查詢

生成反映表格關係的查詢結果

🚀 多表問答模型（MultiTabQA - 基礎規模模型）

MultiTabQA是一個用於多表問答的模型，能夠根據多個輸入表格生成答案表格，解決了現實中涉及多表操作的複雜查詢問題，為表格問答領域提供了更強大的解決方案。

🚀 快速開始

MultiTabQA由Vaishali Pal、Andrew Yates、Evangelos Kanoulas和Maarten de Rijke等人在論文MultiTabQA: Generating Tabular Answers for Multi-Table Question Answering中提出。原始代碼倉庫可在此處找到。

✨ 主要特性

MultiTabQA是一個表格問答（tableQA）模型，可根據多個輸入表格生成答案表格。
能夠處理多表操作符，如UNION、INTERSECT、EXCEPT、JOINS等。
基於TAPEX（BART）架構，包含一個雙向（類似BERT）的編碼器和一個自迴歸（類似GPT）的解碼器。

📦 安裝指南

文檔未提及安裝步驟，暫不展示。

💻 使用示例

基礎用法

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import pandas as pd

tokenizer = AutoTokenizer.from_pretrained("vaishali/multitabqa-base")
model = AutoModelForSeq2SeqLM.from_pretrained("vaishali/multitabqa-base")

question = "How many departments are led by heads who are not mentioned?"
table_names = ['department', 'management']
tables=[{"columns":["Department_ID","Name","Creation","Ranking","Budget_in_Billions","Num_Employees"],
                  "index":[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14],
                  "data":[
                          [1,"State","1789",1,9.96,30266.0],
                          [2,"Treasury","1789",2,11.1,115897.0],
                          [3,"Defense","1947",3,439.3,3000000.0],
                          [4,"Justice","1870",4,23.4,112557.0],
                          [5,"Interior","1849",5,10.7,71436.0],
                          [6,"Agriculture","1889",6,77.6,109832.0],
                          [7,"Commerce","1903",7,6.2,36000.0],
                          [8,"Labor","1913",8,59.7,17347.0],
                          [9,"Health and Human Services","1953",9,543.2,67000.0],
                          [10,"Housing and Urban Development","1965",10,46.2,10600.0],
                          [11,"Transportation","1966",11,58.0,58622.0],
                          [12,"Energy","1977",12,21.5,116100.0],
                          [13,"Education","1979",13,62.8,4487.0],
                          [14,"Veterans Affairs","1989",14,73.2,235000.0],
                          [15,"Homeland Security","2002",15,44.6,208000.0]
                        ]
                  },
                  {"columns":["department_ID","head_ID","temporary_acting"],
                    "index":[0,1,2,3,4],
                    "data":[
                            [2,5,"Yes"],
                            [15,4,"Yes"],
                            [2,6,"Yes"],
                            [7,3,"No"],
                            [11,10,"No"]
                          ]
                  }]

input_tables = [pd.read_json(table, orient="split") for table in tables]

# flatten the model inputs in the format: query + " " + <table_name> : table_name1 + flattened_table1 + <table_name> : table_name2 + flattened_table2 + ...  
#flattened_input = question + " " + [f"<table_name> : {table_name} linearize_table(table) for table_name, table in zip(table_names, tables)]
model_input_string = """How many departments are led by heads who are not mentioned? <table_name> : department col : Department_ID | Name | Creation | Ranking | Budget_in_Billions | Num_Employees row 1 : 1 | State | 1789 | 1 | 9.96 | 30266 row 2 : 2 | Treasury | 1789 | 2 | 11.1 | 115897 row 3 : 3 | Defense | 1947 | 3 | 439.3 | 3000000 row 4 : 4 | Justice | 1870 | 4 | 23.4 | 112557 row 5 : 5 | Interior | 1849 | 5 | 10.7 | 71436 row 6 : 6 | Agriculture | 1889 | 6 | 77.6 | 109832 row 7 : 7 | Commerce | 1903 | 7 | 6.2 | 36000 row 8 : 8 | Labor | 1913 | 8 | 59.7 | 17347 row 9 : 9 | Health and Human Services | 1953 | 9 | 543.2 | 67000 row 10 : 10 | Housing and Urban Development | 1965 | 10 | 46.2 | 10600 row 11 : 11 | Transportation | 1966 | 11 | 58.0 | 58622 row 12 : 12 | Energy | 1977 | 12 | 21.5 | 116100 row 13 : 13 | Education | 1979 | 13 | 62.8 | 4487 row 14 : 14 | Veterans Affairs | 1989 | 14 | 73.2 | 235000 row 15 : 15 | Homeland Security | 2002 | 15 | 44.6 | 208000 <table_name> : management col : department_ID | head_ID | temporary_acting row 1 : 2 | 5 | Yes row 2 : 15 | 4 | Yes row 3 : 2 | 6 | Yes row 4 : 7 | 3 | No row 5 : 11 | 10 | No"""
inputs = tokenizer(model_input_string, return_tensors="pt")

outputs = model.generate(**inputs)

print(tokenizer.batch_decode(outputs, skip_special_tokens=True))
# 'col : count(*) row 1 : 11'

高級用法

文檔未提及高級用法代碼示例，暫不展示。

微調方法

微調腳本可在此處找到。

📚 詳細文檔

模型描述

MultiTabQA是一個表格問答模型，可根據多個輸入表格生成答案表格。它能夠處理多表操作符，如UNION、INTERSECT、EXCEPT、JOINS等。

MultiTabQA基於TAPEX（BART）架構，該架構包含一個雙向（類似BERT）的編碼器和一個自迴歸（類似GPT）的解碼器。

預期用途

可以使用原始模型對多個輸入表格執行SQL查詢。該模型在Spider數據集上進行了微調，能夠回答關於多個輸入表格的自然語言問題。

信息表格

屬性	詳情
模型類型	多表問答模型，基於TAPEX（BART）架構
訓練數據	vaishali/spider - tableQA數據集

引用信息

@inproceedings{pal-etal-2023-multitabqa,
    title = "{M}ulti{T}ab{QA}: Generating Tabular Answers for Multi-Table Question Answering",
    author = "Pal, Vaishali  and
      Yates, Andrew  and
      Kanoulas, Evangelos  and
      de Rijke, Maarten",
    booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = jul,
    year = "2023",
    address = "Toronto, Canada",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.acl-long.348",
    doi = "10.18653/v1/2023.acl-long.348",
    pages = "6322--6334",
    abstract = "Recent advances in tabular question answering (QA) with large language models are constrained in their coverage and only answer questions over a single table. However, real-world queries are complex in nature, often over multiple tables in a relational database or web page. Single table questions do not involve common table operations such as set operations, Cartesian products (joins), or nested queries. Furthermore, multi-table operations often result in a tabular output, which necessitates table generation capabilities of tabular QA models. To fill this gap, we propose a new task of answering questions over multiple tables. Our model, MultiTabQA, not only answers questions over multiple tables, but also generalizes to generate tabular answers. To enable effective training, we build a pre-training dataset comprising of 132,645 SQL queries and tabular answers. Further, we evaluate the generated tables by introducing table-specific metrics of varying strictness assessing various levels of granularity of the table structure. MultiTabQA outperforms state-of-the-art single table QA models adapted to a multi-table QA setting by finetuning on three datasets: Spider, Atis and GeoQuery.",
}