đ Donut KYC Document Reading Model
Donut is an end - to - end VDU model for general document image understanding, specifically trained for Indian KYC documents.
đ Quick Start
To start using the model, you can follow the inference code example below.
⨠Features
- End - to - end Design: Donut is a self - contained VDU model that doesn't rely on OCR - related modules.
- Transformer - based Architecture: Composed of a visual encoder and a textual decoder, both based on Transformer architecture, enabling easy end - to - end training.
- Multi - function for KYC: Can classify and read the contents of Aadhar, PAN, and Voter documents, detect orientation, and distinguish between colored and black - and - white documents.
đ Documentation
Model description
Donut is an end - to - end (i.e., self - contained) VDU model for the general understanding of document images. The architecture of Donut is quite simple, which consists of a Transformer based visual encoder and textual decoder modules.
Donut does not rely on any modules related to OCR functionality but uses a visual encoder for extracting features from a given document image.
The following textual decoder maps the derived features into a sequence of subword tokens to construct a desired structured format (e.g., JSON). Each model component is Transformer - based, and thus the model is trained easily in an end - to - end manner.

Intended uses and limitations
This model is trained to be used for reading the contents of Indian KYC documents. It can classify and read the contents of Aadhar, PAN and Voter. It also can detect the orientation and whether the document is coloured or Black and White. The document for input can be oriented in any direction.
The model should be provided with a fair - quality image (so that the contents are readable).
It has been trained on limited data so the performance might not be very good. In future versions, the number of images will be more and more types of KYC documents can be added to this.
Training data
For v1, a custom dataset has been used for the training purpose where around 283 images were used, out of which 199 were for training, 42 were for validation and 42 were for testing.
Out of 199 images, 57 Aadhar samples, 57 PAN samples and 85 Voter samples were used.
Performance
The current performance is as follows:
- Overall accuracy = 74 %
- Aadhar = 49 % (need to check out, the reason behind the less accuracy)
- PAN = 94 %
- Voter = 76 %
đģ Usage Examples
Basic Usage
from transformers import DonutProcessor, VisionEncoderDecoderModel
import re
import cv2
import json
import torch
from tqdm.auto import tqdm
import numpy as np
from donut import JSONParseEvaluator
processor = DonutProcessor.from_pretrained("sourinkarmakar/kyc_v1-donut-demo")
model = VisionEncoderDecoderModel.from_pretrained("sourinkarmakar/kyc_v1-donut-demo")
dataset = glob.glob(os.path.join(basepath, "unseen_samples/*"))
output_list = []
for idx, sample in tqdm(enumerate(dataset), total=len(dataset)):
img = cv2.imread(sample)
img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
pixel_values = processor(img, return_tensors="pt").pixel_values
pixel_values = pixel_values.to(device)
task_prompt = "<s_cord-v2>"
decoder_input_ids = processor.tokenizer(task_prompt, add_special_tokens=False, return_tensors="pt").input_ids
decoder_input_ids = decoder_input_ids.to(device)
outputs = model.generate(
pixel_values,
decoder_input_ids=decoder_input_ids,
max_length=model.decoder.config.max_position_embeddings,
early_stopping=True,
pad_token_id=processor.tokenizer.pad_token_id,
eos_token_id=processor.tokenizer.eos_token_id,
use_cache=True,
num_beams=1,
bad_words_ids=[[processor.tokenizer.unk_token_id]],
return_dict_in_generate=True,
)
seq = processor.batch_decode(outputs.sequences)[0]
seq = seq.replace(processor.tokenizer.eos_token, "").replace(processor.tokenizer.pad_token, "")
seq = re.sub(r"<.*?>", "", seq, count=1).strip()
seq = processor.token2json(seq)
output_list.append(seq)
print(output_list)
đ License
No license information provided in the original document.