VisualHeist-large Open-source Object Detection Model - Freely Extract Charts, Tables, Headers, Footers, etc. from PDFs

Visualheist Large

Developed by shixuanleong

VisualHeist is an object detection model specifically designed to extract charts, schematics, and tables from PDF files, including titles, headers, and footers.

Object Detection

PyTorch

Open Source License:MIT #PDF chart extraction #Scientific literature processing #Multi-version adaptation

Downloads 1,693

Release Time : 10/28/2024

Model Overview

VisualHeist can accurately identify and segment charts and tables in PDF documents by fine-tuning the object detection model, improving the automation level and work efficiency of document processing.

Model Features

Multiple version options

Two model scales, basic and large versions, are provided to meet the requirements of different hardware configurations.

High-quality training data

Fine-tuning is performed using 3435 charts and 1716 tables, and all data is manually annotated.

Wide applicability

It performs well on literature in various disciplinary fields, including chemistry, materials science, biology, etc.

Model Capabilities

PDF document parsing

Chart detection

Table detection

Schematic detection

Academic literature processing

Use Cases

Academic research

Literature data extraction

Automatically extract chart and table data from scientific research papers

The F1 score reaches 93% (overall)

Document processing

PDF content structuring

Automatically classify and extract visual elements in PDF documents

The F1 score reaches 92% on supplementary materials

🚀 VisualHeist - figure, scheme and table segmentation from PDFs (with captions, headers & footnotes)

VisualHeist is an object detection model designed to extract tables and figures from PDFs, along with their captions, headers, and footnotes. It offers a practical solution for efficiently processing PDF documents and retrieving valuable visual information.

🚀 Quick Start

Refer to our github repository for detailed instructions on how to run the model.

✨ Features

Two Model Versions: VisualHeist comes in two versions, visualheist-base (0.23B) and visualheist-large (0.77B). The base model is recommended for low - RAM systems.
Fine - Tuned from Strong Checkpoints: The models are finetuned from [microsoft/Florence - 2](https://huggingface.co/microsoft/Florence - 2 - large - ft) checkpoints, leveraging pre - trained knowledge.
Inspired by Existing Work: Adapted from [yifeihu/TF - ID](https://huggingface.co/yifeihu/TF - ID - large), it benefits from previous research.
Manually Annotated Data: The models were finetuned using 3435 figures and 1716 tables from 110 PDF articles. All bounding boxes are manually annotated using [CoCo Annotator](https://github.com/jsbroks/coco - annotator).
Specific Input - Output: The TF - ID models take an image of a single paper page as input and return image files for all figures, schemes, and tables in the given page.

📦 Installation

No specific installation steps are provided in the original README.

📚 Documentation

Model Summary

VisualHeist is an object detection model finetuned to extract tables and figures from PDFs. It has two versions:

visualheist - base[[HF]](https://huggingface.co/shixuanleong/visualheist - base) (0.23B)
visualheist - large[[HF]](https://huggingface.co/shixuanleong/visualheist - large) (0.77B)

The base model is recommended if you are running it on low - RAM systems

The models are finetuned from [microsoft/Florence - 2](https://huggingface.co/microsoft/Florence - 2 - large - ft) checkpoints and are inspired by and adapted from [yifeihu/TF - ID](https://huggingface.co/yifeihu/TF - ID - large).

The models were finetuned with 3435 figures and 1716 tables from 110 PDF articles across various publishers. All bounding boxes are manually annotated using [CoCo Annotator](https://github.com/jsbroks/coco - annotator).
TF - ID models take an image of a single paper page as the input, and return image files for all figures, schemes and tables in the given page.

Training Code and Dataset

Dataset: Zenodo repository
Code: [github.com/aspuru - guzik - group/MERMaid](https://github.com/aspuru - guzik - group/MERMaid)

Benchmarks

We manually curated a diverse evaluation dataset consisting of 121 literature articles covering a range of topics, including organic and inorganic chemistry, atmospheric science, batteries, materials science, metal - organic frameworks (MOFs), biology, and science education. These PDFs, published between 1949 and 2025, include both main articles and supplementary materials.

We also additionally curated another collection of 98 literature articles (MERMaid - 100) reporting novel reaction methodologies that spans three distinct chemical domains: organic electrosynthesis, photocatalysis, and organic synthesis.

Additional performance discussion can be found from our preprint article

The full DOI lists can be downloaded from ourZenodo repository.

The evaluation results for visualheist - large are:

Property	Details
Total Images (All)	1935, F1 score: 93%
Total Images (Main)	423, F1 score: 96%
Total Images (pre - 2000)	260, F1 score: 93%
Total Images (Supplementary Materials)	1252, F1 score: 92%
Total Images (MERMaid - 100)	100, F1 score: 99%

🔧 Technical Details

The models are finetuned from [microsoft/Florence - 2](https://huggingface.co/microsoft/Florence - 2 - large - ft) checkpoints. The training data consists of 3435 figures and 1716 tables from 110 PDF articles across various publishers, with all bounding boxes manually annotated using [CoCo Annotator](https://github.com/jsbroks/coco - annotator).

📄 License

This project is licensed under the [MIT License](https://huggingface.co/microsoft/Florence - 2 - base - ft/resolve/main/LICENSE).

BibTex and citation info

<To be updated with our archive citation>

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご