dit-large-finetuned-rvlcdip Open-source Document Image Classification Model

Home

Dit Large Finetuned Rvlcdip

Developed by microsoft

Document image classification model pretrained on IIT-CDIP and fine-tuned on RVL-CDIP, using Transformer architecture

Image Classification

Transformers

#Document Image Classification #Self-supervised Pretraining #RVL-CDIP Fine-tuning

Downloads 67

Release Time : 3/7/2022

Model Overview

This model is a Transformer encoder pretrained in a self-supervised manner on a large-scale document image collection, primarily used for tasks like document image classification

Model Features

Large-scale Pretraining

Pretrained on 42 million document images from IIT-CDIP dataset

Domain-specific Fine-tuning

Fine-tuned on RVL-CDIP document image dataset containing 16 categories

Transformer Architecture

Uses the same Transformer encoder architecture as BEiT

Self-supervised Learning

Pretrained using masked image patch prediction task

Model Capabilities

Document image classification

Document feature extraction

Image patch encoding

Use Cases

Document Processing

Document Classification

Classify document images into 16 predefined categories

Performs well on RVL-CDIP dataset

Table Detection

Identify table regions in documents

Document Layout Analysis

Analyze document layout structure

🚀 Document Image Transformer (large-sized model)

The Document Image Transformer (DiT) is pre - trained on a large - scale document image dataset and fine - tuned for various document - related tasks, offering powerful feature extraction capabilities for document images.

🚀 Quick Start

The Document Image Transformer (DiT) model is pre - trained on IIT - CDIP (Lewis et al., 2006), a dataset with 42 million document images. It's then fine - tuned on [RVL - CDIP](https://www.cs.cmu.edu/~aharley/rvl - cdip/), which consists of 400,000 grayscale images divided into 16 classes, with 25,000 images per class. This model was introduced in the paper DiT: Self - supervised Pre - training for Document Image Transformer by Li et al. and first released in this repository. Note that DiT has the same architecture as BEiT.

Disclaimer: The team releasing DiT didn't write a model card for this model, so this model card is written by the Hugging Face team.

✨ Features

Self - supervised Pre - training: The DiT model is pre - trained in a self - supervised manner on a large collection of images, enabling it to learn an inner representation of images.
Versatile Downstream Tasks: It can be fine - tuned for various document - related tasks such as document image classification, table detection, and document layout analysis.

📚 Documentation

Model description

The Document Image Transformer (DiT) is a transformer encoder model (similar to BERT) pre - trained on a large collection of images in a self - supervised way. The pre - training goal is to predict visual tokens from the encoder of a discrete VAE (dVAE) based on masked patches.

Images are fed into the model as a sequence of fixed - size patches (resolution 16x16), which are linearly embedded. Absolute position embeddings are added before the sequence is fed into the Transformer encoder layers.

Through pre - training, the model learns an internal representation of images, which can be used to extract features useful for downstream tasks. For example, if you have a labeled document image dataset, you can train a standard classifier by adding a linear layer on top of the pre - trained encoder.

Intended uses & limitations

You can use the raw model to encode document images into a vector space, but it's mainly designed to be fine - tuned for tasks like document image classification, table detection, or document layout analysis. Check the model hub to find fine - tuned versions for tasks that interest you.

💻 Usage Examples

Basic Usage

Here is how to use this model in PyTorch:

from transformers import AutoImageProcessor, AutoModelForImageClassification
import torch
from PIL import Image

image = Image.open('path_to_your_document_image').convert('RGB')

processor = AutoImageProcessor.from_pretrained("microsoft/dit-large-finetuned-rvlcdip")
model = AutoModelForImageClassification.from_pretrained("microsoft/dit-large-finetuned-rvlcdip")

inputs = processor(images=image, return_tensors="pt")
outputs = model(**inputs)
logits = outputs.logits

# model predicts one of the 16 RVL-CDIP classes
predicted_class_idx = logits.argmax(-1).item()
print("Predicted class:", model.config.id2label[predicted_class_idx])

BibTeX entry and citation info

@article{Lewis2006BuildingAT,
  title={Building a test collection for complex document information processing},
  author={David D. Lewis and Gady Agam and Shlomo Engelson Argamon and Ophir Frieder and David A. Grossman and Jefferson Heard},
  journal={Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval},
  year={2006}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご