đ ChartGemma: Visual Instruction-tuning for Chart Reasoning in the Wild
ChartGemma is a novel chart understanding and reasoning model. It addresses the drawbacks of existing methods by training on instruction - tuning data generated directly from chart images, achieving state - of - the - art results across multiple benchmarks.
đ Documentation
The abstract of the paper states that:
Given the ubiquity of charts as a data analysis, visualization, and decision - making tool across industries and sciences, there has been a growing interest in developing pre - trained foundation models as well as general purpose instruction - tuned models for chart understanding and reasoning. However, existing methods suffer crucial drawbacks across two critical axes affecting the performance of chart representation models: they are trained on data generated from underlying data tables of the charts, ignoring the visual trends and patterns in chart images, and use weakly aligned vision - language backbone models for domain - specific training, limiting their generalizability when encountering charts in the wild. We address these important drawbacks and introduce ChartGemma, a novel chart understanding and reasoning model developed over PaliGemma. Rather than relying on underlying data tables, ChartGemma is trained on instruction - tuning data generated directly from chart images, thus capturing both high - level trends and low - level visual information from a diverse set of charts. Our simple approach achieves state - of - the - art results across $5$ benchmarks spanning chart summarization, question answering, and fact - checking, and our elaborate qualitative studies on real - world charts show that ChartGemma generates more realistic and factually correct summaries compared to its contemporaries.
Paper Link
đ Quick Start
Web Demo
If you wish to quickly try our model, you can access our public web demo hosted on the Hugging Face Spaces platform with a friendly interface!
[ChartGemma Web Demo](https://huggingface.co/spaces/ahmed - masry/ChartGemma)
Inference
You can easily use our models for inference with the huggingface library!
You just need to do the following:
- Change the image_path to your chart example image path on your system
- Write the input_text
We recommend using beam search with a beam size of 4, but if your machine has low memory, you can remove the num_beams from the generate method.
from PIL import Image
import requests
from transformers import AutoProcessor, PaliGemmaForConditionalGeneration
import torch
torch.hub.download_url_to_file('https://raw.githubusercontent.com/vis - nlp/ChartQA/main/ChartQA%20Dataset/val/png/multi_col_1229.png', 'chart_example_1.png')
image_path = "/content/chart_example_1.png"
input_text ="program of thought: what is the sum of Faceboob Messnger and Whatsapp values in the 18 - 29 age group?"
model = PaliGemmaForConditionalGeneration.from_pretrained("ahmed - masry/chartgemma", torch_dtype=torch.float16)
processor = AutoProcessor.from_pretrained("ahmed - masry/chartgemma")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)
image = Image.open(image_path).convert('RGB')
inputs = processor(text=input_text, images=image, return_tensors="pt")
prompt_length = inputs['input_ids'].shape[1]
inputs = {k: v.to(device) for k, v in inputs.items()}
generate_ids = model.generate(**inputs, num_beams=4, max_new_tokens=512)
output_text = processor.batch_decode(generate_ids[:, prompt_length:], skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
print(output_text)
đ License
This project is licensed under the MIT license.
đ Contact
If you have any questions about this work, please contact Ahmed Masry using the following email addresses: amasry17@ku.edu.tr or ahmed.elmasry24653@gmail.com.
đ Reference
Please cite our paper if you use our model in your research.
@misc{masry2024chartgemmavisualinstructiontuningchart,
title={ChartGemma: Visual Instruction - tuning for Chart Reasoning in the Wild},
author={Ahmed Masry and Megh Thakkar and Aayush Bajaj and Aaryaman Kartha and Enamul Hoque and Shafiq Joty},
year={2024},
eprint={2407.04172},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2407.04172},
}