dit-base-finetuned-rvlcdip Open-source Document Image Classification Model - Accurately and Efficiently Complete Document Image Classification

Dit Base Finetuned Rvlcdip

Developed by microsoft

DiT is a Transformer-based document image classification model, pretrained on the IIT-CDIP dataset and fine-tuned on the RVL-CDIP dataset

Image Classification

Transformers

#Document Image Classification #Self-supervised Pretraining #Multi-category Recognition

Downloads 31.99k

Release Time : 3/7/2022

Model Overview

This model is pretrained on a large number of document images through self-supervised learning, primarily for document image classification tasks, capable of encoding document images into vector representations

Model Features

Self-supervised Pretraining

Pretrained on large-scale document images using masked image patch prediction tasks

Document Image Classification

Classification capability specifically optimized for document images, supporting 16 document categories

Transformer Architecture

Adopts the same Transformer architecture as BEiT, suitable for processing image data

Model Capabilities

Document Image Classification

Document Feature Extraction

Image Encoding

Use Cases

Document Management

Automatic Document Classification

Automatically classifies scanned documents into 16 categories such as advertisements, scientific publications, etc.

Performs well on the RVL-CDIP dataset

Information Extraction

Document Layout Analysis

Identifies different regions and structures within documents

🚀 Document Image Transformer (base-sized model)

The Document Image Transformer (DiT) is pre - trained on a large document image dataset and fine - tuned on RVL - CDIP, which can be used for various document - related downstream tasks.

🚀 Quick Start

The Document Image Transformer (DiT) model is pre - trained on IIT - CDIP (Lewis et al., 2006), a dataset with 42 million document images. It's then fine - tuned on RVL - CDIP, a dataset of 400,000 grayscale images divided into 16 classes (25,000 images per class). It was introduced in the paper DiT: Self - supervised Pre - training for Document Image Transformer by Li et al. and first released in this repository. Note that DiT has the same architecture as BEiT.

Disclaimer: The team releasing DiT did not write a model card for this model, so this model card has been written by the Hugging Face team.

✨ Features

Tags: dit, vision, image - classification
Datasets: rvl_cdip
Widget examples:
- Advertisement
- Scientific publication

📚 Documentation

Model description

The Document Image Transformer (DiT) is a transformer encoder model (BERT - like) pre - trained on a large collection of images in a self - supervised fashion. The pre - training objective for the model is to predict visual tokens from the encoder of a discrete VAE (dVAE), based on masked patches.

Images are presented to the model as a sequence of fixed - size patches (resolution 16x16), which are linearly embedded. One also adds absolute position embeddings before feeding the sequence to the layers of the Transformer encoder.

By pre - training the model, it learns an inner representation of images that can then be used to extract features useful for downstream tasks: if you have a dataset of labeled document images for instance, you can train a standard classifier by placing a linear layer on top of the pre - trained encoder.

Intended uses & limitations

You can use the raw model for encoding document images into a vector space, but it's mostly meant to be fine - tuned on tasks like document image classification, table detection or document layout analysis. See the model hub to look for fine - tuned versions on a task that interests you.

💻 Usage Examples

Basic Usage

from transformers import AutoImageProcessor, AutoModelForImageClassification
import torch
from PIL import Image

image = Image.open('path_to_your_document_image').convert('RGB')

processor = AutoImageProcessor.from_pretrained("microsoft/dit-base-finetuned-rvlcdip")
model = AutoModelForImageClassification.from_pretrained("microsoft/dit-base-finetuned-rvlcdip")

inputs = processor(images=image, return_tensors="pt")
outputs = model(**inputs)
logits = outputs.logits

# model predicts one of the 16 RVL-CDIP classes
predicted_class_idx = logits.argmax(-1).item()
print("Predicted class:", model.config.id2label[predicted_class_idx])

BibTeX entry and citation info

@article{Lewis2006BuildingAT,
  title={Building a test collection for complex document information processing},
  author={David D. Lewis and Gady Agam and Shlomo Engelson Argamon and Ophir Frieder and David A. Grossman and Jefferson Heard},
  journal={Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval},
  year={2006}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご