đ Arabic Large Nougat
End-to-End Structured OCR For Arabic books.
This model is an end-to-end structured Optical Character Recognition (OCR) system tailored for the Arabic language, offering a solution for converting images of Arabic book pages into structured text, especially in Markdown format.
đ Quick Start
Demo
You can try the model through the online demo: Demo Link
Local Usage
First, make sure to update the transformers
library:
pip install -U transformers
Here is the code example to use the model locally:
from PIL import Image
import torch
from transformers import NougatProcessor, VisionEncoderDecoderModel
processor = NougatProcessor.from_pretrained("MohamedRashad/arabic-large-nougat")
model = VisionEncoderDecoderModel.from_pretrained(
"MohamedRashad/arabic-large-nougat",
torch_dtype=torch.bfloat16,
attn_implementation={"decoder": "flash_attention_2", "encoder": "eager"},
)
context_length = model.decoder.config.max_position_embeddings
torch_dtype = model.dtype
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)
def predict(img_path):
image = Image.open(img_path)
pixel_values = (
processor(image, return_tensors="pt").pixel_values.to(torch_dtype).to(device)
)
outputs = model.generate(
pixel_values.to(device),
repetition_penalty=1.5,
min_length=1,
max_new_tokens=context_length,
bad_words_ids=[[processor.tokenizer.unk_token_id]],
)
page_sequence = processor.batch_decode(outputs, skip_special_tokens=True)[0]
return page_sequence
print(predict("path/to/page_image.jpg"))
⨠Features
- End - to - End OCR: Specifically designed for Arabic books, it can directly convert images of book pages into structured text.
- Based on New Tokenizer: Trained from scratch using the riotu-lab/Aranizer-PBE-86k tokenizer with the base nougat architecture.
- Trained on Custom Dataset: Utilizes the MohamedRashad/arabic-img2md dataset for training.
đ Documentation
Bias, Risks, and Limitations
- Text Hallucination: Due to the complexity of OCR tasks, the model may occasionally generate repeated or incorrect text.
- Erroneous Image Paths: Sometimes, the model may output image paths that are not relevant to the input, indicating occasional confusion.
- Context Length Constraint: The model has a maximum context length of 2048 tokens, which may lead to incomplete transcriptions for longer book pages.
Intended Use
The arabic-large-nougat OCR is designed for tasks involving the conversion of Arabic book page images into structured text, particularly when Markdown format is preferred. It is suitable for digitizing Arabic literature and facilitating text extraction from printed materials.
Ethical Considerations
It is essential to be aware of the model's limitations, especially in cases where accurate OCR results are crucial. Users are advised to verify and review the output, especially in scenarios where precision is of utmost importance.
đ§ Technical Details
- Developed by: Mohamed Rashad
- Model type: VisionEncoderDecoderModel
- Language(s) (NLP): Arabic & English
- License: GPL 3.0
đ License
The model is licensed under GPL 3.0. If you use or build upon the arabic-large-nougat OCR, please acknowledge the model developer and the open-source community for their contributions. Additionally, be sure to include a copy of the GPL 3.0 license with any redistributed or modified versions of the model.
Citation
If you find this model useful, please cite the corresponding research paper:
@misc{rashad2024arabicnougatfinetuningvisiontransformers,
title={Arabic-Nougat: Fine-Tuning Vision Transformers for Arabic OCR and Markdown Extraction},
author={Mohamed Rashad},
year={2024},
eprint={2411.17835},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2411.17835},
}
Disclaimer
The arabic-large-nougat OCR is a tool provided "as is," and the developers make no guarantees regarding its suitability for specific tasks. Users are encouraged to thoroughly evaluate the model's output for their particular use cases and requirements.
[**Github**](https://github.com/MohamedAliRashad/arabic-nougat) đ¤ [**Hugging Face**](https://huggingface.co/collections/MohamedRashad/arabic-nougat-673a3f540bd92904c9b92a8e) đ [**Paper**](https://arxiv.org/abs/2411.17835) đī¸ [**Data**](https://huggingface.co/datasets/MohamedRashad/arabic-img2md) đŊī¸ [**Demo**](https://huggingface.co/spaces/MohamedRashad/Arabic-Nougat)