Dit-base-layout-detection Open-source Document Layout Detection Model - Free to Detect 11 Types of Document Elements

Dit Base Layout Detection

Developed by cmarkea

Document image layout detection model fine-tuned based on microsoft/dit-base, capable of recognizing 11 types of document elements

Image Segmentation

Transformers

Open Source License:Apache-2.0 #Document Layout Analysis #Multi-element Segmentation #PDF Structure Recognition

Downloads 704

Release Time : 7/18/2024

Model Overview

This model can extract different layout elements (such as text, images, headings, footnotes, etc.) from document images, making it particularly suitable for processing document collections that need to be imported into Open-domain Question Answering (ODQA) systems.

Model Features

Multi-category Document Element Recognition

Capable of recognizing 11 types of document elements, including image captions, footnotes, formulas, list items, headers and footers, etc.

Fine-tuned on DocLayNet

Fine-tuned on the DocLayNet dataset, specifically optimized for document layout analysis tasks

Dual Evaluation Metrics

Supports both semantic segmentation and object detection evaluation methods, providing comprehensive performance assessment

Model Capabilities

Document Image Analysis

Layout Element Recognition

Semantic Segmentation

Object Detection

Use Cases

Document Processing

Open-domain QA System Document Preprocessing

Automatically identifies and classifies different elements in documents when preparing them for ODQA systems

Improves document structuring and enhances the comprehension capabilities of QA systems

Document Digitization

Automatically identifies various region types when converting scanned documents into structured digital formats

Enhances the efficiency and accuracy of document digitization

🚀 DIT-base-layout-detection

This model, cmarkea/dit-base-layout-detection, enables the extraction of various layouts (Text, Picture, Caption, Footnote, etc.) from document images. It's a fine - tuned version of [dit - base](https://huggingface.co/microsoft/dit - base) on the DocLayNet dataset, perfect for processing documentary corpora for ODQA systems.

🚀 Quick Start

The model cmarkea/dit - base - layout - detection allows you to extract 11 entities from document images, namely: Caption, Footnote, Formula, List - item, Page - footer, Page - header, Picture, Section - header, Table, Text, and Title.

✨ Features

Entity Extraction: Capable of extracting 11 different entities from document images.
Fine - Tuned: Based on the [dit - base](https://huggingface.co/microsoft/dit - base) model, fine - tuned on the DocLayNet dataset.
Suitable for ODQA: Ideal for processing documentary corpora to be ingested into an ODQA system.

📚 Documentation

Performance

In this section, we evaluate the model's performance from two aspects: semantic segmentation and object detection. For semantic segmentation, we use the F1 - score to evaluate each pixel's classification, and for object detection, we use the Generalized Intersection over Union (GIoU) and the accuracy of the predicted bounding box class. The evaluation is conducted on 500 pages from the PDF evaluation dataset of DocLayNet.

Class	f1 - score (x100)	GIoU (x100)	accuracy (x100)
Background	94.98	NA	NA
Caption	75.54	55.61	72.62
Footnote	72.29	50.08	70.97
Formula	82.29	49.91	94.48
List - item	67.56	35.19	69
Page - footer	83.93	57.99	94.06
Page - header	62.33	65.25	79.39
Picture	78.32	58.22	92.71
Section - header	69.55	56.64	78.29
Table	83.69	63.03	90.13
Text	90.94	51.89	88.09
Title	61.19	52.64	70

Benchmark

Let's compare the performance of this model with another model:

Model	f1 - score (x100)	GIoU (x100)	accuracy (x100)
cmarkea/dit - base - layout - detection	90.77	56.29	85.26
[cmarkea/detr - layout - detection](https://huggingface.co/cmarkea/detr - layout - detection)	91.27	80.66	90.46

💻 Usage Examples

Basic Usage

import torch
from transformers import AutoImageProcessor, BeitForSemanticSegmentation

img_proc = AutoImageProcessor.from_pretrained(
    "cmarkea/dit-base-layout-detection"
)
model = BeitForSemanticSegmentation.from_pretrained(
    "cmarkea/dit-base-layout-detection"
)

img: PIL.Image

with torch.inference_mode():
    input_ids = img_proc(img, return_tensors='pt')
    output = model(**input_ids)

segmentation = img_proc.post_process_semantic_segmentation(
    output,
    target_sizes=[img.size[::-1]]
)

Advanced Usage

Here is a simple method for detecting bounding boxes from semantic segmentation. This is the method used to calculate the model's performance in object detection, as described in the "Performance" section. The method is provided without any additional post - processing.

import cv2

def detect_bboxes(masks: np.ndarray):
    r"""
    A simple bounding box detection function
    """
    detected_blocks = []
    contours, _ = cv2.findContours(
        masks.astype(np.uint8),
        cv2.RETR_EXTERNAL,
        cv2.CHAIN_APPROX_SIMPLE
    )
    for contour in list(contours):
        if len(list(contour)) >= 4:
            # smallest rectangle containing all points
            x, y, width, height = cv2.boundingRect(contour)
            bounding_box = [x, y, x + width, y + height]
            detected_blocks.append(bounding_box)
    return detected_blocks

bbox_pred = []
for segment in segmentation:
    boxes, labels = [], []
    for ii in range(1, len(model.config.label2id)):
        mm = segment == ii
        if mm.sum() > 0:
            bbx = detect_bboxes(mm.numpy())
            boxes.extend(bbx)
            labels.extend([ii]*len(bbx))
    bbox_pred.append(dict(boxes=boxes, labels=labels))

Example

example

📄 License

This project is licensed under the Apache - 2.0 license.

📖 Citation

@online{DeDitLay,
  AUTHOR = {Cyrile Delestre},
  URL = {https://huggingface.co/cmarkea/dit-base-layout-detection},
  YEAR = {2024},
  KEYWORDS = {Image Processing ; Transformers ; Layout},
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご