dit-base-layout-detection開源文檔佈局檢測模型

首頁

Dit Base Layout Detection

由cmarkea開發

基於microsoft/dit-base微調的文檔圖像佈局檢測模型，可識別11類文檔元素

圖像分割

Transformers

開源協議:Apache-2.0 #文檔佈局分析 #多元素分割 #PDF結構識別

下載量 704

發布時間 : 7/18/2024

模型概述

該模型可從文檔圖像中提取不同佈局元素（如文本、圖片、標題、腳註等），特別適合處理需要導入開放域問答系統(ODQA)的文檔集。

模型特點

多類別文檔元素識別

可識別11類文檔元素，包括圖片說明、腳註、公式、列表項、頁眉頁腳等

基於DocLayNet微調

在DocLayNet數據集上微調，專門針對文檔佈局分析任務優化

雙重評估指標

同時支持語義分割和目標檢測兩種評估方式，提供全面的性能評估

模型能力

文檔圖像分析

佈局元素識別

語義分割

目標檢測

使用案例

文檔處理

開放域問答系統文檔預處理

為ODQA系統準備文檔時自動識別和分類文檔中的不同元素

提高文檔結構化程度，增強問答系統理解能力

文檔數字化

將掃描文檔轉換為結構化數字格式時自動識別各區域類型

提升文檔數字化效率和準確性

🚀 DIT-base-layout-detection

DIT-base-layout-detection 模型（cmarkea/dit-base-layout-detection）可從文檔圖像中提取不同的佈局元素，如文本、圖片、標題、腳註等。該模型基於 dit-base 在 DocLayNet 數據集上進行微調，非常適合處理要導入開放域問答（ODQA）系統的文檔語料。此模型能夠提取 11 種實體，包括標題、腳註、公式、列表項、頁面頁腳、頁面頁眉、圖片、章節標題、表格、文本和標題。

✨ 主要特性

可從文檔圖像中精準提取多種佈局元素。
基於大規模數據集微調，適用於處理文檔語料。
能提取 11 種不同類型的實體。

📦 安裝指南

由於文檔未提供具體安裝步驟，此部分跳過。

💻 使用示例

基礎用法

import torch
from transformers import AutoImageProcessor, BeitForSemanticSegmentation

img_proc = AutoImageProcessor.from_pretrained(
    "cmarkea/dit-base-layout-detection"
)
model = BeitForSemanticSegmentation.from_pretrained(
    "cmarkea/dit-base-layout-detection"
)

img: PIL.Image

with torch.inference_mode():
    input_ids = img_proc(img, return_tensors='pt')
    output = model(**input_ids)

segmentation = img_proc.post_process_semantic_segmentation(
    output,
    target_sizes=[img.size[::-1]]
)

高級用法

import cv2

def detect_bboxes(masks: np.ndarray):
    r"""
    A simple bounding box detection function
    """
    detected_blocks = []
    contours, _ = cv2.findContours(
        masks.astype(np.uint8),
        cv2.RETR_EXTERNAL,
        cv2.CHAIN_APPROX_SIMPLE
    )
    for contour in list(contours):
        if len(list(contour)) >= 4:
            # smallest rectangle containing all points
            x, y, width, height = cv2.boundingRect(contour)
            bounding_box = [x, y, x + width, y + height]
            detected_blocks.append(bounding_box)
    return detected_blocks

bbox_pred = []
for segment in segmentation:
    boxes, labels = [], []
    for ii in range(1, len(model.config.label2id)):
        mm = segment == ii
        if mm.sum() > 0:
            bbx = detect_bboxes(mm.numpy())
            boxes.extend(bbx)
            labels.extend([ii]*len(bbx))
    bbox_pred.append(dict(boxes=boxes, labels=labels))

📚 詳細文檔

性能評估

在這部分，我們將分別從語義分割和目標檢測兩個方面評估模型的性能。對於語義分割，我們未進行任何後處理；對於目標檢測，僅應用了 OpenCV 的 findContours 函數，未進行進一步的後處理。

在語義分割方面，我們使用 F1 分數來評估每個像素的分類情況；在目標檢測方面，我們基於廣義交併比（GIoU）和預測邊界框類別的準確性來評估性能。評估在 DocLayNet 的 PDF 評估數據集的 500 頁上進行。

類別	F1 分數 (x100)	GIoU (x100)	準確率 (x100)
背景	94.98	NA	NA
標題	75.54	55.61	72.62
腳註	72.29	50.08	70.97
公式	82.29	49.91	94.48
列表項	67.56	35.19	69
頁面頁腳	83.93	57.99	94.06
頁面頁眉	62.33	65.25	79.39
圖片	78.32	58.22	92.71
章節標題	69.55	56.64	78.29
表格	83.69	63.03	90.13
文本	90.94	51.89	88.09
標題	61.19	52.64	70

基準測試

現在，讓我們將該模型的性能與其他模型進行比較。

模型	F1 分數 (x100)	GIoU (x100)	準確率 (x100)
cmarkea/dit-base-layout-detection	90.77	56.29	85.26
cmarkea/detr-layout-detection	91.27	80.66	90.46

示例

🔧 技術細節

文檔未提供具體技術細節，此部分跳過。

📄 許可證

本項目採用 Apache-2.0 許可證。

引用

@online{DeDitLay,
  AUTHOR = {Cyrile Delestre},
  URL = {https://huggingface.co/cmarkea/dit-base-layout-detection},
  YEAR = {2024},
  KEYWORDS = {Image Processing ; Transformers ; Layout},
}