Open-source multimodal chatbot Table LLaVA 7B - Accurately understand table images and complete diverse tasks

Table Llava V1.5 7b

Developed by SpursgoZmy

Table LLaVA 7B is an open-source multimodal chatbot specifically designed for understanding various table images and performing diverse table-related tasks.

Image-to-Text

Transformers

English#Table Image Understanding #Multimodal Question Answering #High-Precision Table Parsing

Downloads 165

Release Time : 6/17/2024

Model Overview

An open-source multimodal chatbot designed for table image understanding, supporting tasks such as Q&A, table cell description, and structural comprehension.

Model Features

Multimodal Table Understanding

Specifically designed for table images, capable of handling various table tasks such as Q&A, description, and structural understanding.

Two-Stage Training Process

Utilizes pre-training and instruction fine-tuning to ensure strong performance on both table and non-table tasks.

High Compatibility

Fully compatible with LLaVA v1.5 code, allowing direct inference using the original codebase.

Model Capabilities

Table Image Understanding

Table Question Answering

Table Cell Description

Table Structure Analysis

Multimodal Instruction Following

Use Cases

Document Processing

Financial Statement Analysis

Automatically parse financial statement images and answer related questions

Data Table Extraction

Extract table data from images and generate structured descriptions

Research Applications

Multimodal Large Model Research

Serves as a benchmark model for multimodal table understanding research

🚀 Table LLaVA Model Card

Table LLaVA 7B is an open - source multimodal chatbot. It can understand different table images and fulfill various table - related requests, such as question answering, table cell description, and structure understanding.

See the ACL 2024 paper for more details: Multimodal Table Understanding

📚 Documentation

✨ Features

Datasets:
- SpursgoZmy/MMTab
- liuhaotian/LLaVA - Instruct - 150K
- liuhaotian/LLaVA - Pretrain
Language: en
Metrics:
- accuracy
- bleu
- f1
Pipeline Tag: image - text - to - text

📦 Model Details

Model Type: Table LLaVA 7B strictly follows the LLaVA - v1.5 model architecture and training pipeline. It uses [CLIP - ViT - L - 336px](https://huggingface.co/openai/clip - vit - large - patch14 - 336) as the visual encoder (336*336 image resolution), [Vicuna - v1.5 - 7B](https://huggingface.co/lmsys/vicuna - 7b - v1.5) as the base LLM, and a two - layer MLP as the vision - language connector.
- Training Pipeline:
  1. Pre - training: Train the vision - language connector with image - caption data and table recognition data.
  2. Instruction tuning: Train the vision - language connector and the base LLM with multimodal instruction following data of tabular and non - tabular tasks.
Code Base: We use the official code of [LLaVA - v1.5](https://github.com/haotian - liu/LLaVA) for model training and inference. The saved model checkpoint is uploaded to this repository. So, Table LLaVA can be used in the same way as the normal LLaVA v1.5 model with its original code.
Model Date: Table - LLaVA 7B was trained in January 2024.
Contact: Send questions or comments about the model here: https://github.com/SpursGoZmy/Table - LLaVA/issues

📊 Training dataset

The training data consists of original LLaVA - 1.5 data and specially constructed multimodal instruction - following data from the MMTab dataset, which is a large - scale dataset covering a wide range of table images and table - related tasks.

Training Stage	Data Description	Data Size	Hugging Face Dataset
Pre - training	558K original LLaVA - 1.5 pre - training data	558K	[blip_laion_cc_sbu_558k.json](https://huggingface.co/datasets/liuhaotian/LLaVA - Pretrain)
	150K table recognition data	150K	MMTab - pre_pretrain_data_llava_format_150K.json
Instruction Fine - tuning	665K original LLaVA - 1.5 fine - tuning data	665K	[llava_v1_5_mix665k.json](https://huggingface.co/datasets/liuhaotian/LLaVA - Instruct - 150K)
	232K multimodal instruction tuning data of 14 tabular tasks	232K	MMTab - instruct_sft_data_llava_format_232K.json

We also offer the merged pre - training and instruction fine - tuning data in the MMTab dataset, namely enhanced_llava_pretrain_data_708K.json and enhanced_llava_sft_data_898K.json, which were used to train Table LLaVA.

📊 Evaluation dataset

A collection of 17 held - in and 7 held - out tabular benchmarks, including 15 table - related tasks, such as table question answering and table2text generation. We also evaluate Table LLaVA on two non - tabular benchmarks: TextVQA and [llava - bench - in - the - wild](https://huggingface.co/datasets/liuhaotian/llava - bench - in - the - wild).

📄 License

🎯 Intended use

Primary intended uses: The main use of Table LLaVA is research on large multimodal models and chatbots, especially for multimodal table understanding.
Primary intended users: The main users of the model are researchers and hobbyists in computer vision, natural language processing, machine learning, and artificial intelligence.

⚠️ Limitations

Table LLaVA takes one table image as the model input. Processing multiple table images would be beneficial to support more application scenarios. Although Table - LLaVA shows great performance on a wide range of table - based tasks, the resolution of input images (336*336) is relatively low and may limit its capacity. Fortunately, with the emergence of MLLMs with higher input image resolution (e.g., Monkey (Li et al., 2023d), LLaVA - Next (Liu et al., 2024)), researchers can use MMTab to develop more powerful tabular MLLM in future research.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご