đ DIT-base-layout-detection
This model, cmarkea/dit-base-layout-detection, enables the extraction of various layouts (Text, Picture, Caption, Footnote, etc.) from document images. It's a fine - tuned version of [dit - base](https://huggingface.co/microsoft/dit - base) on the DocLayNet dataset, perfect for processing documentary corpora for ODQA systems.
đ Quick Start
The model cmarkea/dit - base - layout - detection allows you to extract 11 entities from document images, namely: Caption, Footnote, Formula, List - item, Page - footer, Page - header, Picture, Section - header, Table, Text, and Title.
⨠Features
- Entity Extraction: Capable of extracting 11 different entities from document images.
- Fine - Tuned: Based on the [dit - base](https://huggingface.co/microsoft/dit - base) model, fine - tuned on the DocLayNet dataset.
- Suitable for ODQA: Ideal for processing documentary corpora to be ingested into an ODQA system.
đ Documentation
Performance
In this section, we evaluate the model's performance from two aspects: semantic segmentation and object detection. For semantic segmentation, we use the F1 - score to evaluate each pixel's classification, and for object detection, we use the Generalized Intersection over Union (GIoU) and the accuracy of the predicted bounding box class. The evaluation is conducted on 500 pages from the PDF evaluation dataset of DocLayNet.
Class |
f1 - score (x100) |
GIoU (x100) |
accuracy (x100) |
Background |
94.98 |
NA |
NA |
Caption |
75.54 |
55.61 |
72.62 |
Footnote |
72.29 |
50.08 |
70.97 |
Formula |
82.29 |
49.91 |
94.48 |
List - item |
67.56 |
35.19 |
69 |
Page - footer |
83.93 |
57.99 |
94.06 |
Page - header |
62.33 |
65.25 |
79.39 |
Picture |
78.32 |
58.22 |
92.71 |
Section - header |
69.55 |
56.64 |
78.29 |
Table |
83.69 |
63.03 |
90.13 |
Text |
90.94 |
51.89 |
88.09 |
Title |
61.19 |
52.64 |
70 |
Benchmark
Let's compare the performance of this model with another model:
Model |
f1 - score (x100) |
GIoU (x100) |
accuracy (x100) |
cmarkea/dit - base - layout - detection |
90.77 |
56.29 |
85.26 |
[cmarkea/detr - layout - detection](https://huggingface.co/cmarkea/detr - layout - detection) |
91.27 |
80.66 |
90.46 |
đģ Usage Examples
Basic Usage
import torch
from transformers import AutoImageProcessor, BeitForSemanticSegmentation
img_proc = AutoImageProcessor.from_pretrained(
"cmarkea/dit-base-layout-detection"
)
model = BeitForSemanticSegmentation.from_pretrained(
"cmarkea/dit-base-layout-detection"
)
img: PIL.Image
with torch.inference_mode():
input_ids = img_proc(img, return_tensors='pt')
output = model(**input_ids)
segmentation = img_proc.post_process_semantic_segmentation(
output,
target_sizes=[img.size[::-1]]
)
Advanced Usage
Here is a simple method for detecting bounding boxes from semantic segmentation. This is the method used to calculate the model's performance in object detection, as described in the "Performance" section. The method is provided without any additional post - processing.
import cv2
def detect_bboxes(masks: np.ndarray):
r"""
A simple bounding box detection function
"""
detected_blocks = []
contours, _ = cv2.findContours(
masks.astype(np.uint8),
cv2.RETR_EXTERNAL,
cv2.CHAIN_APPROX_SIMPLE
)
for contour in list(contours):
if len(list(contour)) >= 4:
x, y, width, height = cv2.boundingRect(contour)
bounding_box = [x, y, x + width, y + height]
detected_blocks.append(bounding_box)
return detected_blocks
bbox_pred = []
for segment in segmentation:
boxes, labels = [], []
for ii in range(1, len(model.config.label2id)):
mm = segment == ii
if mm.sum() > 0:
bbx = detect_bboxes(mm.numpy())
boxes.extend(bbx)
labels.extend([ii]*len(bbx))
bbox_pred.append(dict(boxes=boxes, labels=labels))
Example

đ License
This project is licensed under the Apache - 2.0 license.
đ Citation
@online{DeDitLay,
AUTHOR = {Cyrile Delestre},
URL = {https://huggingface.co/cmarkea/dit-base-layout-detection},
YEAR = {2024},
KEYWORDS = {Image Processing ; Transformers ; Layout},
}