🚀 TrOCR (small-sized model, fine-tuned on IAM)
TrOCR is a model fine-tuned on the IAM dataset. It was introduced in the paper TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models by Li et al. and first released in this repository. It's designed for optical character recognition on single text-line images.
✨ Features
- Encoder - Decoder Architecture: The TrOCR model is an encoder - decoder model. It uses an image Transformer as the encoder and a text Transformer as the decoder. The image encoder is initialized from the weights of DeiT, and the text decoder is initialized from the weights of UniLM.
- Patch - based Image Input: Images are presented to the model as a sequence of fixed - size patches (resolution 16x16), which are linearly embedded. Absolute position embeddings are added before feeding the sequence to the Transformer encoder layers. The Transformer text decoder then autoregressively generates tokens.
🚀 Quick Start
You can use the raw model for optical character recognition (OCR) on single text - line images. Check out the model hub to find fine - tuned versions for tasks that interest you.
💻 Usage Examples
Basic Usage
from transformers import TrOCRProcessor, VisionEncoderDecoderModel
from PIL import Image
import requests
url = 'https://fki.tic.heia-fr.ch/static/img/a01-122-02-00.jpg'
image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
processor = TrOCRProcessor.from_pretrained('microsoft/trocr-small-handwritten')
model = VisionEncoderDecoderModel.from_pretrained('microsoft/trocr-small-handwritten')
pixel_values = processor(images=image, return_tensors="pt").pixel_values
generated_ids = model.generate(pixel_values)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
📚 Documentation
Model description
The TrOCR model is structured as an encoder - decoder system. The encoder part is an image Transformer, and the decoder is a text Transformer. The initialization of the image encoder comes from the weights of DeiT, and the text decoder uses the weights of UniLM.
When processing images, they are first divided into a sequence of fixed - size patches (16x16 resolution). These patches are linearly embedded, and absolute position embeddings are added before passing the sequence through the Transformer encoder layers. Subsequently, the Transformer text decoder generates tokens in an autoregressive manner.
Intended uses & limitations
This model is suitable for optical character recognition on single text - line images. You can search the model hub for fine - tuned versions tailored to specific tasks.
BibTeX entry and citation info
@misc{li2021trocr,
title={TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models},
author={Minghao Li and Tengchao Lv and Lei Cui and Yijuan Lu and Dinei Florencio and Cha Zhang and Zhoujun Li and Furu Wei},
year={2021},
eprint={2109.10282},
archivePrefix={arXiv},
primaryClass={cs.CL}
}