đ VisualHeist - figure, scheme and table segmentation from PDFs (with captions, headers & footnotes)
VisualHeist is an object detection model designed to extract tables and figures from PDFs, along with their captions, headers, and footnotes. It offers a practical solution for efficiently processing PDF documents and retrieving valuable visual information.
đ Quick Start
Refer to our github repository for detailed instructions on how to run the model.
⨠Features
- Two Model Versions: VisualHeist comes in two versions,
visualheist-base
(0.23B) and visualheist-large
(0.77B). The base model is recommended for low - RAM systems.
- Fine - Tuned from Strong Checkpoints: The models are finetuned from [microsoft/Florence - 2](https://huggingface.co/microsoft/Florence - 2 - large - ft) checkpoints, leveraging pre - trained knowledge.
- Inspired by Existing Work: Adapted from [yifeihu/TF - ID](https://huggingface.co/yifeihu/TF - ID - large), it benefits from previous research.
- Manually Annotated Data: The models were finetuned using 3435 figures and 1716 tables from 110 PDF articles. All bounding boxes are manually annotated using [CoCo Annotator](https://github.com/jsbroks/coco - annotator).
- Specific Input - Output: The TF - ID models take an image of a single paper page as input and return image files for all figures, schemes, and tables in the given page.
đĻ Installation
No specific installation steps are provided in the original README.
đ Documentation
Model Summary
VisualHeist is an object detection model finetuned to extract tables and figures from PDFs. It has two versions:
visualheist - base
[[HF]](https://huggingface.co/shixuanleong/visualheist - base) (0.23B)
visualheist - large
[[HF]](https://huggingface.co/shixuanleong/visualheist - large) (0.77B)
The base model is recommended if you are running it on low - RAM systems
The models are finetuned from [microsoft/Florence - 2](https://huggingface.co/microsoft/Florence - 2 - large - ft) checkpoints and are inspired by and adapted from [yifeihu/TF - ID](https://huggingface.co/yifeihu/TF - ID - large).
- The models were finetuned with 3435 figures and 1716 tables from 110 PDF articles across various publishers. All bounding boxes are manually annotated using [CoCo Annotator](https://github.com/jsbroks/coco - annotator).
- TF - ID models take an image of a single paper page as the input, and return image files for all figures, schemes and tables in the given page.
Training Code and Dataset
- Dataset: Zenodo repository
- Code: [github.com/aspuru - guzik - group/MERMaid](https://github.com/aspuru - guzik - group/MERMaid)
Benchmarks
We manually curated a diverse evaluation dataset consisting of 121 literature articles covering a range of topics, including organic and inorganic chemistry, atmospheric science, batteries, materials science, metal - organic frameworks (MOFs), biology, and science education. These PDFs, published between 1949 and 2025, include both main articles and supplementary materials.
We also additionally curated another collection of 98 literature articles (MERMaid - 100) reporting novel reaction methodologies that spans three distinct chemical domains: organic electrosynthesis, photocatalysis, and organic synthesis.
Additional performance discussion can be found from our preprint article
The full DOI lists can be downloaded from ourZenodo repository.
The evaluation results for visualheist - large
are:
Property |
Details |
Total Images (All) |
1935, F1 score: 93% |
Total Images (Main) |
423, F1 score: 96% |
Total Images (pre - 2000) |
260, F1 score: 93% |
Total Images (Supplementary Materials) |
1252, F1 score: 92% |
Total Images (MERMaid - 100) |
100, F1 score: 99% |
đ§ Technical Details
The models are finetuned from [microsoft/Florence - 2](https://huggingface.co/microsoft/Florence - 2 - large - ft) checkpoints. The training data consists of 3435 figures and 1716 tables from 110 PDF articles across various publishers, with all bounding boxes manually annotated using [CoCo Annotator](https://github.com/jsbroks/coco - annotator).
đ License
This project is licensed under the [MIT License](https://huggingface.co/microsoft/Florence - 2 - base - ft/resolve/main/LICENSE).
BibTex and citation info
<To be updated with our archive citation>