đ Florence-2-DocLayNet-Fixed
This project finetunes the Florence-2-large-ft model on the DocLayNet-v1.1 dataset, optimizing class names to enhance performance and usability.
đ Quick Start
Use the following code to start using the model. For non-CUDA environments, refer to this post for a simple patch: https://huggingface.co/microsoft/Florence-2-base/discussions/4
đģ Usage Examples
Basic Usage
import requests
from PIL import Image
from transformers import AutoProcessor, AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("yifeihu/Florence-2-DocLayNet-Fixed", trust_remote_code=True)
processor = AutoProcessor.from_pretrained("yifeihu/Florence-2-DocLayNet-Fixed", trust_remote_code=True)
prompt = "<OD>"
url = "https://huggingface.co/yifeihu/TF-ID-base/resolve/main/arxiv_2305_10853_5.png?download=true"
image = Image.open(requests.get(url, stream=True).raw)
inputs = processor(text=prompt, images=image, return_tensors="pt")
generated_ids = model.generate(
input_ids=inputs["input_ids"],
pixel_values=inputs["pixel_values"],
max_new_tokens=1024,
do_sample=False,
num_beams=3
)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]
parsed_answer = processor.post_process_generation(generated_text, task="<OD>", image_size=(image.width, image.height))
print(parsed_answer)
To visualize the results, see this tutorial notebook for more details.
⨠Features
We finetuned the Florence-2-large-ft [HF] model using the [DocLayNet-v1.1] dataset. To prevent the model from generating hallucinated class names, we re-mapped all class names to single tokens:
Original Class Names |
New Class Names |
Caption |
Cap |
Footnote |
Footnote |
Formula |
Math |
List-item |
List |
Page-footer |
Bottom |
Page-header |
Header |
Picture |
Picture |
Section-header |
Section |
Table |
Table |
Text |
Text |
Title |
Title |
By applying this simple change, we observed 7% improvement of mAP50-95 score on the DocLayNet test set. The training and inference was also faster thanks to fewer tokens used by the class names.
From the mAP50-95 score, this model is far from SOTA on the DocLayNet test set (70%). Much smaller Yolo models (github.com/ppaanngggg/yolo-doclaynet)[https://github.com/ppaanngggg/yolo-doclaynet] have much better benchmark results (~79%). On the subset of scientific articles, this model performed on par with the best Yolo models (87%) in terms of mAP50-95.
However, after we performed some qualitative analysis (paper coming soon), we found that Florence-2 is much better at drawing bounding boxes with clean edges. Yolo models sometimes cut text in the middle or draw multiple bounding boxes on the same object. These behaviors are not seriously published in mAP50-95 but are painful to deal with in real-world use cases. When calculating the mAP scores, we had to manually set the confidence score as 1 for all Florence-2 output.
đ Documentation
BibTex and citation info
@misc{TF-ID,
author = {Yifei Hu},
title = {TF-ID: Table/Figure IDentifier for academic papers},
year = {2024},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/ai8hyf/TF-ID}},
}
@article{doclaynet2022,
title = {DocLayNet: A Large Human-Annotated Dataset for Document-Layout Analysis},
doi = {10.1145/3534678.353904},
url = {https://arxiv.org/abs/2206.01062},
author = {Pfitzmann, Birgit and Auer, Christoph and Dolfi, Michele and Nassar, Ahmed S and Staar, Peter W J},
year = {2022}
}
đ License
This project is licensed under the Apache-2.0 license.