Open-source tapas-medium table question-answering model, which can answer questions based on the associated text of English Wikipedia tables.

Tapas Medium

Developed by google

A table-based question answering model based on the Transformer architecture, pretrained in a self-supervised manner on English Wikipedia tables and associated text

Large Language Model

Transformers

EnglishOpen Source License:Apache-2.0 #Table Question Answering #Numerical Reasoning #Weakly Supervised Pretraining

Downloads 23

Release Time : 3/2/2022

Model Overview

TAPAS is a BERT-like model specifically designed for tabular data, capable of understanding table structures and answering related questions, supporting tasks such as table content entailment judgment

Model Features

Table Structure Understanding

Capable of processing flattened tabular data structures, understanding row-column relationships and cell contents

Dual-Objective Pretraining

Combines masked language modeling (MLM) and intermediate pretraining to enhance numerical reasoning capabilities for tables

Position Embedding Options

Provides both relative position embeddings (default) and absolute position embeddings versions

Model Capabilities

Tabular data understanding

Table question answering

Table content entailment judgment

Table-text joint representation learning

Use Cases

Intelligent Document Processing

Financial Statement Analysis

Extract answers to specific questions from financial statements

Knowledge Question Answering

Wikipedia Table Question Answering

Answer questions based on Wikipedia table content

🚀 TAPAS medium model

This model offers two usable versions. The latest and default version corresponds to the tapas_inter_masklm_medium_reset checkpoint in the original Github repository. It was pre - trained on MLM and an additional step called intermediate pre - training by the authors, and uses relative position embeddings by default (resetting the position index at every cell of the table).

The other (non - default) version uses absolute position embeddings: revision="no_reset", corresponding to tapas_inter_masklm_medium.

Disclaimer: The team releasing TAPAS didn't write a model card for this model. This model card was written by the Hugging Face team and contributors.

✨ Features

Two - version availability: Offers both relative and absolute position embedding versions.
Self - supervised pre - training: Pretrained on a large English Wikipedia corpus in a self - supervised way.
Dual pre - training objectives: Trained on Masked Language Modeling (MLM) and intermediate pre - training for numerical reasoning.

🚀 Quick Start

This model can be used to obtain hidden representations of table - question pairs in its raw form. However, it is mainly designed to be fine - tuned for downstream tasks like question answering or sequence classification. You can search for fine - tuned versions on the model hub.

📚 Documentation

Model description

TAPAS is a BERT - like transformers model. It was pretrained on a large corpus of English Wikipedia data in a self - supervised manner, using only raw tables and associated texts without human labeling. It has two pre - training objectives:

Masked language modeling (MLM): Given a (flattened) table and associated context, the model randomly masks 15% of the input words and then predicts them. This allows the model to learn a bidirectional representation of tables and associated texts.
Intermediate pre - training: To promote numerical reasoning on tables, the model was further pre - trained on a balanced dataset of millions of syntactically created training examples. The model must predict whether a sentence is supported or refuted by the table contents.

Intended uses & limitations

The raw model can be used for getting hidden representations of table - question pairs. But it's mostly for fine - tuning on downstream tasks. Check the model hub for fine - tuned versions.

Training procedure

Preprocessing

The texts are lowercased and tokenized using WordPiece with a vocabulary size of 30,000. The model inputs are in the form:

[CLS] Sentence [SEP] Flattened table [SEP]

Pre - training

The model was pre - trained on 32 Cloud TPU v3 cores for 1,000,000 steps, with a maximum sequence length of 512 and a batch size of 512. MLM pre - training alone takes about 3 days. It was also pre - trained on a second task (table entailment). For more details, refer to the original TAPAS paper and the follow - up paper. The optimizer used is Adam with a learning rate of 5e - 5 and a warmup ratio of 0.01.

BibTeX entry and citation info

@misc{herzig2020tapas,
      title={TAPAS: Weakly Supervised Table Parsing via Pre-training}, 
      author={Jonathan Herzig and PaweÅ‚ Krzysztof Nowak and Thomas MÃ¼ller and Francesco Piccinno and Julian Martin Eisenschlos},
      year={2020},
      eprint={2004.02349},
      archivePrefix={arXiv},
      primaryClass={cs.IR}
}

@misc{eisenschlos2020understanding,
      title={Understanding tables with intermediate pre-training}, 
      author={Julian Martin Eisenschlos and Syrine Krichene and Thomas MÃ¼ller},
      year={2020},
      eprint={2010.00571},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

📄 License

This model is licensed under the apache - 2.0 license.

Property	Details
Model Type	TAPAS medium model with two versions (relative and absolute position embeddings)
Training Data	A large corpus of English data from Wikipedia

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご