Florence-2-DocLayNet-Fixed Open-source Model - Free to Help Efficiently Complete Document Layout Analysis Tasks

Florence 2 DocLayNet Fixed

Developed by yifeihu

Florence-2 model fine-tuned on the DocLayNet dataset, specialized for document layout analysis tasks, with improved performance through simplified category names

Image-to-Text

Safetensors

Open Source License:Apache-2.0 #Document layout analysis #Scientific paper optimization #Single-token categories

Downloads 95

Release Time : 10/29/2024

Model Overview

This model is a fine-tuned version of Florence-2-large-ft, optimized for document layout analysis tasks, specifically addressing the classification and localization of visual elements in documents.

Model Features

Optimized category names

Simplified original category names to single tokens, improving model performance by 7% and accelerating training and inference

Bounding box quality

Produces clearer bounding box edges, avoiding text truncation and multiple box issues

Scientific paper optimization

Excellent performance on scientific paper subsets, achieving 87% mAP50-95

Model Capabilities

Document layout analysis

Visual element detection

Text region recognition

Table detection

Formula recognition

Use Cases

Academic document processing

Paper figure and table recognition

Automatically identifies figures, tables, formulas, and other elements in academic papers

Achieves 87% mAP50-95 on scientific paper subsets

Document digitization

Document structure parsing

Analyzes document layout structure to identify headers, footers, titles, and other elements

Overall mAP50-95 reaches 70%

🚀 Florence-2-DocLayNet-Fixed

This project finetunes the Florence-2-large-ft model on the DocLayNet-v1.1 dataset, optimizing class names to enhance performance and usability.

🚀 Quick Start

Use the following code to start using the model. For non-CUDA environments, refer to this post for a simple patch: https://huggingface.co/microsoft/Florence-2-base/discussions/4

💻 Usage Examples

Basic Usage

import requests
from PIL import Image
from transformers import AutoProcessor, AutoModelForCausalLM 
model = AutoModelForCausalLM.from_pretrained("yifeihu/Florence-2-DocLayNet-Fixed", trust_remote_code=True)
processor = AutoProcessor.from_pretrained("yifeihu/Florence-2-DocLayNet-Fixed", trust_remote_code=True)
prompt = "<OD>"
url = "https://huggingface.co/yifeihu/TF-ID-base/resolve/main/arxiv_2305_10853_5.png?download=true"
image = Image.open(requests.get(url, stream=True).raw)
inputs = processor(text=prompt, images=image, return_tensors="pt")
generated_ids = model.generate(
    input_ids=inputs["input_ids"],
    pixel_values=inputs["pixel_values"],
    max_new_tokens=1024,
    do_sample=False,
    num_beams=3
)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]
parsed_answer = processor.post_process_generation(generated_text, task="<OD>", image_size=(image.width, image.height))
print(parsed_answer)

To visualize the results, see this tutorial notebook for more details.

✨ Features

We finetuned the Florence-2-large-ft [HF] model using the [DocLayNet-v1.1] dataset. To prevent the model from generating hallucinated class names, we re-mapped all class names to single tokens:

Original Class Names	New Class Names
Caption	Cap
Footnote	Footnote
Formula	Math
List-item	List
Page-footer	Bottom
Page-header	Header
Picture	Picture
Section-header	Section
Table	Table
Text	Text
Title	Title

By applying this simple change, we observed 7% improvement of mAP50-95 score on the DocLayNet test set. The training and inference was also faster thanks to fewer tokens used by the class names.

From the mAP50-95 score, this model is far from SOTA on the DocLayNet test set (70%). Much smaller Yolo models (github.com/ppaanngggg/yolo-doclaynet)[https://github.com/ppaanngggg/yolo-doclaynet] have much better benchmark results (~79%). On the subset of scientific articles, this model performed on par with the best Yolo models (87%) in terms of mAP50-95.

However, after we performed some qualitative analysis (paper coming soon), we found that Florence-2 is much better at drawing bounding boxes with clean edges. Yolo models sometimes cut text in the middle or draw multiple bounding boxes on the same object. These behaviors are not seriously published in mAP50-95 but are painful to deal with in real-world use cases. When calculating the mAP scores, we had to manually set the confidence score as 1 for all Florence-2 output.

📚 Documentation

BibTex and citation info

@misc{TF-ID,
  author = {Yifei Hu},
  title = {TF-ID: Table/Figure IDentifier for academic papers},
  year = {2024},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/ai8hyf/TF-ID}},
}

@article{doclaynet2022,
  title = {DocLayNet: A Large Human-Annotated Dataset for Document-Layout Analysis},  
  doi = {10.1145/3534678.353904},
  url = {https://arxiv.org/abs/2206.01062},
  author = {Pfitzmann, Birgit and Auer, Christoph and Dolfi, Michele and Nassar, Ahmed S and Staar, Peter W J},
  year = {2022}
}

📄 License

This project is licensed under the Apache-2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご