Multitabqa Base Geoquery
模型概述
MultiTabQA基於TAPEX(BART)架構,包含雙向編碼器和自迴歸解碼器,專門用於處理涉及多個表格的自然語言問題。
模型特點
多表操作支持
能夠處理UNION、INTERSECT、EXCEPT、JOINS等多表操作
表格生成能力
不僅能回答問題,還能生成表格形式的答案
預訓練數據集
使用包含132,645條SQL查詢和表格答案的預訓練數據集
模型能力
多表問答
表格生成
SQL操作執行
使用案例
數據庫查詢
部門負責人分析
分析哪些部門的負責人未被提及
可以生成包含統計結果的表格
商業智能
跨表數據分析
從多個相關表格中提取並整合信息
生成整合後的數據表格
🚀 MultiTabQA (基礎規模模型)
MultiTabQA是一個用於多表問答的模型,它能從多個輸入表中生成答案表,解決了傳統單表問答模型在處理複雜多表查詢時的侷限性,為複雜的表格數據查詢提供了有效的解決方案。
🚀 快速開始
模型描述
MultiTabQA由Vaishali Pal、Andrew Yates、Evangelos Kanoulas和Maarten de Rijke在論文 MultiTabQA: Generating Tabular Answers for Multi-Table Question Answering 中提出。原始代碼倉庫可在 這裡 找到。
MultiTabQA是一個表格問答(tableQA)模型,可從多個輸入表中生成答案表。它能夠處理多表操作符,如UNION、INTERSECT、EXCEPT、JOINS等。該模型基於TAPEX(BART)架構,包含一個雙向(類似BERT)的編碼器和一個自迴歸(類似GPT)的解碼器。
預期用途
你可以使用該原始模型對多個輸入表執行SQL查詢。該模型已在GeoQuery數據集上進行了微調,能夠回答關於多個輸入表的自然語言問題。
使用方法
以下是在transformers庫中使用該模型的示例代碼:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import pandas as pd
tokenizer = AutoTokenizer.from_pretrained("vaishali/multitabqa-base-geoquery")
model = AutoModelForSeq2SeqLM.from_pretrained("vaishali/multitabqa-base-geoquery")
question = "How many departments are led by heads who are not mentioned?"
table_names = ['department', 'management']
tables=[{"columns":["Department_ID","Name","Creation","Ranking","Budget_in_Billions","Num_Employees"],
"index":[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14],
"data":[
[1,"State","1789",1,9.96,30266.0],
[2,"Treasury","1789",2,11.1,115897.0],
[3,"Defense","1947",3,439.3,3000000.0],
[4,"Justice","1870",4,23.4,112557.0],
[5,"Interior","1849",5,10.7,71436.0],
[6,"Agriculture","1889",6,77.6,109832.0],
[7,"Commerce","1903",7,6.2,36000.0],
[8,"Labor","1913",8,59.7,17347.0],
[9,"Health and Human Services","1953",9,543.2,67000.0],
[10,"Housing and Urban Development","1965",10,46.2,10600.0],
[11,"Transportation","1966",11,58.0,58622.0],
[12,"Energy","1977",12,21.5,116100.0],
[13,"Education","1979",13,62.8,4487.0],
[14,"Veterans Affairs","1989",14,73.2,235000.0],
[15,"Homeland Security","2002",15,44.6,208000.0]
]
},
{"columns":["department_ID","head_ID","temporary_acting"],
"index":[0,1,2,3,4],
"data":[
[2,5,"Yes"],
[15,4,"Yes"],
[2,6,"Yes"],
[7,3,"No"],
[11,10,"No"]
]
}]
input_tables = [pd.read_json(table, orient="split") for table in tables]
# flatten the model inputs in the format: query + " " + <table_name> : table_name1 + flattened_table1 + <table_name> : table_name2 + flattened_table2 + ...
#flattened_input = question + " " + [f"<table_name> : {table_name} linearize_table(table) for table_name, table in zip(table_names, tables)]
model_input_string = """How many departments are led by heads who are not mentioned? <table_name> : department col : Department_ID | Name | Creation | Ranking | Budget_in_Billions | Num_Employees row 1 : 1 | State | 1789 | 1 | 9.96 | 30266 row 2 : 2 | Treasury | 1789 | 2 | 11.1 | 115897 row 3 : 3 | Defense | 1947 | 3 | 439.3 | 3000000 row 4 : 4 | Justice | 1870 | 4 | 23.4 | 112557 row 5 : 5 | Interior | 1849 | 5 | 10.7 | 71436 row 6 : 6 | Agriculture | 1889 | 6 | 77.6 | 109832 row 7 : 7 | Commerce | 1903 | 7 | 6.2 | 36000 row 8 : 8 | Labor | 1913 | 8 | 59.7 | 17347 row 9 : 9 | Health and Human Services | 1953 | 9 | 543.2 | 67000 row 10 : 10 | Housing and Urban Development | 1965 | 10 | 46.2 | 10600 row 11 : 11 | Transportation | 1966 | 11 | 58.0 | 58622 row 12 : 12 | Energy | 1977 | 12 | 21.5 | 116100 row 13 : 13 | Education | 1979 | 13 | 62.8 | 4487 row 14 : 14 | Veterans Affairs | 1989 | 14 | 73.2 | 235000 row 15 : 15 | Homeland Security | 2002 | 15 | 44.6 | 208000 <table_name> : management col : department_ID | head_ID | temporary_acting row 1 : 2 | 5 | Yes row 2 : 15 | 4 | Yes row 3 : 2 | 6 | Yes row 4 : 7 | 3 | No row 5 : 11 | 10 | No"""
inputs = tokenizer(model_input_string, return_tensors="pt")
outputs = model.generate(**inputs)
print(tokenizer.batch_decode(outputs, skip_special_tokens=True))
# 'col : count(*) row 1 : 11'
微調方法
請在 這裡 找到微調腳本。
BibTeX引用和引用信息
@inproceedings{pal-etal-2023-multitabqa,
title = "{M}ulti{T}ab{QA}: Generating Tabular Answers for Multi-Table Question Answering",
author = "Pal, Vaishali and
Yates, Andrew and
Kanoulas, Evangelos and
de Rijke, Maarten",
booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
month = jul,
year = "2023",
address = "Toronto, Canada",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.acl-long.348",
doi = "10.18653/v1/2023.acl-long.348",
pages = "6322--6334",
abstract = "Recent advances in tabular question answering (QA) with large language models are constrained in their coverage and only answer questions over a single table. However, real-world queries are complex in nature, often over multiple tables in a relational database or web page. Single table questions do not involve common table operations such as set operations, Cartesian products (joins), or nested queries. Furthermore, multi-table operations often result in a tabular output, which necessitates table generation capabilities of tabular QA models. To fill this gap, we propose a new task of answering questions over multiple tables. Our model, MultiTabQA, not only answers questions over multiple tables, but also generalizes to generate tabular answers. To enable effective training, we build a pre-training dataset comprising of 132,645 SQL queries and tabular answers. Further, we evaluate the generated tables by introducing table-specific metrics of varying strictness assessing various levels of granularity of the table structure. MultiTabQA outperforms state-of-the-art single table QA models adapted to a multi-table QA setting by finetuning on three datasets: Spider, Atis and GeoQuery.",
}
📄 許可證
本項目採用MIT許可證。
屬性 | 詳情 |
---|---|
模型類型 | 多表問答模型 |
訓練數據 | vaishali/geoQuery-tableQA |
Distilbert Base Cased Distilled Squad
Apache-2.0
DistilBERT是BERT的輕量級蒸餾版本,參數量減少40%,速度提升60%,保留95%以上性能。本模型是在SQuAD v1.1數據集上微調的問答專用版本。
問答系統 英語
D
distilbert
220.76k
244
Distilbert Base Uncased Distilled Squad
Apache-2.0
DistilBERT是BERT的輕量級蒸餾版本,參數量減少40%,速度提升60%,在GLUE基準測試中保持BERT 95%以上的性能。本模型專為問答任務微調。
問答系統
Transformers 英語

D
distilbert
154.39k
115
Tapas Large Finetuned Wtq
Apache-2.0
TAPAS是基於BERT架構的表格問答模型,通過自監督方式在維基百科表格數據上預訓練,支持對錶格內容進行自然語言問答
問答系統
Transformers 英語

T
google
124.85k
141
T5 Base Question Generator
基於t5-base的問答生成模型,輸入答案和上下文,輸出相應問題
問答系統
Transformers

T
iarfmoose
122.74k
57
Bert Base Cased Qa Evaluator
基於BERT-base-cased的問答對評估模型,用於判斷問題和答案是否語義相關
問答系統
B
iarfmoose
122.54k
9
Tiny Doc Qa Vision Encoder Decoder
MIT
一個基於MIT許可證的文檔問答模型,主要用於測試目的。
問答系統
Transformers

T
fxmarty
41.08k
16
Dpr Question Encoder Single Nq Base
DPR(密集段落檢索)是用於開放領域問答研究的工具和模型。該模型是基於BERT的問題編碼器,使用自然問題(NQ)數據集訓練。
問答系統
Transformers 英語

D
facebook
32.90k
30
Mobilebert Uncased Squad V2
MIT
MobileBERT是BERT_LARGE的輕量化版本,在SQuAD2.0數據集上微調而成的問答系統模型。
問答系統
Transformers 英語

M
csarron
29.11k
7
Tapas Base Finetuned Wtq
Apache-2.0
TAPAS是一個基於Transformer的表格問答模型,通過自監督學習在維基百科表格數據上預訓練,並在WTQ等數據集上微調。
問答系統
Transformers 英語

T
google
23.03k
217
Dpr Question Encoder Multiset Base
基於BERT的密集段落檢索(DPR)問題編碼器,用於開放領域問答研究,在多個QA數據集上訓練
問答系統
Transformers 英語

D
facebook
17.51k
4
精選推薦AI模型
Llama 3 Typhoon V1.5x 8b Instruct
專為泰語設計的80億參數指令模型,性能媲美GPT-3.5-turbo,優化了應用場景、檢索增強生成、受限生成和推理任務
大型語言模型
Transformers 支持多種語言

L
scb10x
3,269
16
Cadet Tiny
Openrail
Cadet-Tiny是一個基於SODA數據集訓練的超小型對話模型,專為邊緣設備推理設計,體積僅為Cosmo-3B模型的2%左右。
對話系統
Transformers 英語

C
ToddGoldfarb
2,691
6
Roberta Base Chinese Extractive Qa
基於RoBERTa架構的中文抽取式問答模型,適用於從給定文本中提取答案的任務。
問答系統 中文
R
uer
2,694
98