開源tatr_tab_struct_v2模型 - 高效實現表格結構識別任務，免費可用！

首頁

Tatr Tab Struct V2

由deepdoctection開發

基於PubTables1M和FinTabNet數據集訓練的DETR架構模型，專用於表格結構識別任務

文字識別

Transformers

#表格結構識別 #跨單元格檢測 #金融文檔處理

下載量 99

發布時間 : 9/4/2023

模型概述

該模型採用Transformer架構，能夠識別表格中的行、列、表頭及跨單元格結構，適用於文檔數字化處理場景

模型特點

跨單元格識別

能夠準確識別表格中的合併單元格和跨行列結構

多元素檢測

同時檢測表格、行、列、表頭等多種佈局元素

優化邊緣處理

建議使用5像素填充邊距以獲得最佳識別效果

模型能力

表格區域檢測

行列結構識別

表頭分類

合併單元格檢測

使用案例

文檔數字化

財務報表解析

自動識別複雜財務報表的結構化數據

準確提取行列關係及跨單元格數據

科研文獻處理

從學術論文中提取數據表格內容

保持原始表格的層級關係

🚀 微軟表格變換器表格結構識別模型

本項目是基於Pubtables和Fintabnet數據集訓練的微軟表格變換器（Table Transformer）表格結構識別模型。該模型可用於精準識別表格的結構，為表格數據的處理和分析提供支持。

🚀 快速開始

模型註冊

如果你還沒有該模型的deepdoctection配置文件，請添加以下代碼進行模型註冊：

import deepdoctection as dd

dd.ModelCatalog.register("deepdoctection/tatr_tab_struct_v2/pytorch_model.bin", dd.ModelProfile(
    name="deepdoctection/tatr_tab_struct_v2/pytorch_model.bin",
    description="Table Transformer (DETR) model trained on PubTables1M. It was introduced in the paper "
                "Aligning benchmark datasets for table structure recognition by Smock et "
                "al. This model is devoted to table structure recognition and assumes to receive a slightly cropped"
                "table as input. It will predict rows, column and spanning cells. Use a padding of around 5 pixels",
    size=[115511753],
    tp_model=False,
    config="deepdoctection/tatr_tab_struct_v2/config.json",
    preprocessor_config="deepdoctection/tatr_tab_struct_v2/preprocessor_config.json",
    hf_repo_id="deepdoctection/tatr_tab_struct_v2",
    hf_model_name="pytorch_model.bin",
    hf_config_file=["config.json", "preprocessor_config.json"],
    categories={
        "1": dd.LayoutType.table,
        "2": dd.LayoutType.column,
        "3": dd.LayoutType.row,
        "4": dd.CellType.column_header,
        "5": dd.CellType.projected_row_header,
        "6": dd.CellType.spanning,
    },
    dl_library="PT",
    model_wrapper="HFDetrDerivedDetector",
))

模型運行

在deepdoctection分析器中運行該模型時，你可以調整分割參數以獲得更好的預測結果：

import deepdoctection as dd

analyzer = dd.get_dd_analyzer(reset_config_file=True, config_overwrite=["PT.ITEM.WEIGHTS=deepdoctection/tatr_tab_struct_v2/pytorch_model.bin",
                                                                        "PT.ITEM.FILTER=['table']",
                                                                        "PT.ITEM.PAD.TOP=5",
                                                                        "PT.ITEM.PAD.RIGHT=5",
                                                                        "PT.ITEM.PAD.BOTTOM=5",
                                                                        "PT.ITEM.PAD.LEFT=5",
                                                                        "SEGMENTATION.THRESHOLD_ROWS=0.9",
                                                                        "SEGMENTATION.THRESHOLD_COLS=0.9",
                                                                        "SEGMENTATION.REMOVE_IOU_THRESHOLD_ROWS=0.3",
                                                                        "SEGMENTATION.REMOVE_IOU_THRESHOLD_COLS=0.3",
                                                                        "WORD_MATCHING.MAX_PARENT_ONLY=True"])