Florence-2-large-TableDetection Open-source Table Detection Model - Accurately Locate Table Regions in Images

Florence 2 Large TableDetection

Developed by ucsahin

A multimodal table detection model fine-tuned based on the Florence-2 model, capable of precisely locating table areas in images.

Image-to-Text

Transformers

Open Source License:MIT #Table Detection #Multimodal Model #Document Processing

Downloads 1,993

Release Time : 6/24/2024

Model Overview

This is a multimodal language model fine-tuned for the task of detecting tables in images given text prompts. The model uses a combination of image and text inputs to predict the bounding boxes around tables in the provided images.

Model Features

Multimodal Input

Processes both image and text inputs simultaneously to achieve more precise table detection.

High-precision Detection

Specifically fine-tuned to accurately identify table areas in images.

End-to-end Solution

A complete solution from input images to output bounding boxes.

Model Capabilities

Table Detection in Images

Bounding Box Prediction

Multimodal Processing

Use Cases

Document Processing

PDF Table Extraction

Automatically detect and extract tables from scanned PDF documents.

Accurately identify table positions for subsequent data extraction.

Data Extraction

Table Data Digitization

Convert tables in paper documents to digital formats.

Improve data entry efficiency and reduce manual operations.

🚀 Florence-2-large-TableDetection

This model is a multimodal language model fine-tuned for table detection in images with textual prompts, aiming to automate table detection in various applications.

🚀 Quick Start

This model is a fine - tuned version of microsoft/Florence-2-large-ft on ucsahin/pubtables-detection-1500-samples dataset. It achieves a loss of 0.7601 on the evaluation set.

The microsoft/Florence-2-large-ft can detect various objects in zero - shot setting with the task prompt "<OD>". Check Florence-2-large sample inference for inference usage. However, the ft - base model can't detect tables in a given image.

The following Colab notebook shows how to finetune the model with custom data for object detection:

Florence2-Object Detection-Finetuning-HF-Trainer.ipynb

✨ Features

This model is a multimodal language model fine - tuned for detecting tables in images given textual prompts. It uses a combination of image and text inputs to predict bounding boxes around tables in the provided images.
Its main purpose is to automate the table detection process in images. It can be used in various applications like document processing, data extraction, and image analysis, where identifying tables in images is crucial.

📦 Installation

There is no specific installation steps provided in the original document.

💻 Usage Examples

Basic Usage

In Transformers, you can load the model and perform inference as follows: (Note that trust_remote_code=True is needed to run the model. It will only download the external custom codes from the original HuggingFaceM4/Florence-2-DocVQA.)

from transformers import AutoProcessor, AutoModelForCausalLM
import matplotlib.pyplot as plt
import matplotlib.patches as patches

model_id = "ucsahin/Florence-2-large-TableDetection"
model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True, device_map="cuda") # load the model on GPU
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)

def run_example(task_prompt, image, max_new_tokens=128):
    prompt = task_prompt
    inputs = processor(text=prompt, images=image, return_tensors="pt")
    generated_ids = model.generate(
      input_ids=inputs["input_ids"].cuda(),
      pixel_values=inputs["pixel_values"].cuda(),
      max_new_tokens=max_new_tokens,
      early_stopping=False,
      do_sample=False,
      num_beams=3,
    )
    generated_text = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]
    parsed_answer = processor.post_process_generation(
        generated_text,
        task=task_prompt,
        image_size=(image.width, image.height)
    )
    return parsed_answer

def plot_bbox(image, data):
   # Create a figure and axes
    fig, ax = plt.subplots()
    # Display the image
    ax.imshow(image)
    # Plot each bounding box
    for bbox, label in zip(data['bboxes'], data['labels']):
        # Unpack the bounding box coordinates
        x1, y1, x2, y2 = bbox
        # Create a Rectangle patch
        rect = patches.Rectangle((x1, y1), x2-x1, y2-y1, linewidth=1, edgecolor='r', facecolor='none')
        # Add the rectangle to the Axes
        ax.add_patch(rect)
        # Annotate the label
        plt.text(x1, y1, label, color='white', fontsize=8, bbox=dict(facecolor='red', alpha=0.5))
    # Remove the axis ticks and labels
    ax.axis('off')
    # Show the plot
    plt.show()

########### Inference
from datasets import load_dataset

dataset = load_dataset("ucsahin/pubtables-detection-1500-samples")

example_id = 5
image = dataset["train"][example_id]["image"]

parsed_answer = run_example("<OD>", image=image)
plot_bbox(image, parsed_answer["<OD>"])

📚 Documentation

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 1e - 06
train_batch_size: 8
eval_batch_size: 8
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e - 08
lr_scheduler_type: linear
num_epochs: 10

Training results

Training Loss	Epoch	Step	Validation Loss
1.3199	1.0	169	1.0372
0.7922	2.0	338	0.9169
0.6824	3.0	507	0.8411
0.6109	4.0	676	0.8168
0.5752	5.0	845	0.7915
0.5605	6.0	1014	0.7862
0.5291	7.0	1183	0.7740
0.517	8.0	1352	0.7683
0.5139	9.0	1521	0.7642
0.5005	10.0	1690	0.7601

Framework versions

Transformers 4.42.0.dev0
Pytorch 2.3.0+cu121
Datasets 2.20.0
Tokenizers 0.19.1

📄 License

This model is released under the MIT license.

📦 Information Table

Property	Details
Model Type	Fine - tuned multimodal language model for table detection
Training Data	ucsahin/pubtables-detection-1500-samples

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご