đ Arabic Small Nougat
End-to-End Structured OCR For Arabic books.
The arabic-small-nougat OCR is an end-to-end structured Optical Character Recognition (OCR) system tailored for the Arabic language, which can convert images of Arabic book pages into structured text, especially in Markdown format. It's useful for digitizing Arabic literature and extracting text from printed materials.
đ Quick Start
Demo
You can try the model through the online demo: https://huggingface.co/spaces/MohamedRashad/Arabic-Nougat
Local Usage
Use the following code to start using the model locally:
from PIL import Image
import torch
from transformers import NougatProcessor, VisionEncoderDecoderModel
processor = NougatProcessor.from_pretrained("MohamedRashad/arabic-small-nougat")
model = VisionEncoderDecoderModel.from_pretrained("MohamedRashad/arabic-small-nougat")
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)
context_length = 2048
def predict(img_path):
image = Image.open(img_path)
pixel_values = processor(image, return_tensors="pt").pixel_values
outputs = model.generate(
pixel_values.to(device),
min_length=1,
max_new_tokens=context_length,
bad_words_ids=[[processor.tokenizer.unk_token_id]],
)
page_sequence = processor.batch_decode(outputs, skip_special_tokens=True)[0]
page_sequence = processor.post_process_generation(page_sequence, fix_markdown=False)
return page_sequence
print(predict("path/to/page_image.jpg"))
⨠Features
- End - to - End OCR: Directly convert Arabic book page images into structured text.
- Multi - language Support: Supports both Arabic and English.
- Markdown Output: Ideal for generating structured Markdown text.
đ Documentation
Description
[**Github**](https://github.com/MohamedAliRashad/arabic-nougat) đ¤ [**Hugging Face**](https://huggingface.co/collections/MohamedRashad/arabic-nougat-673a3f540bd92904c9b92a8e) đ [**Paper**](https://arxiv.org/abs/2411.17835) đī¸ [**Data**](https://huggingface.co/datasets/MohamedRashad/arabic-img2md) đŊī¸ [**Demo**](https://huggingface.co/spaces/MohamedRashad/Arabic-Nougat)
The arabic-small-nougat OCR is based on the facebook/nougat-small architecture and has been fine - tuned using the Khatt dataset along with a custom dataset.
Bias, Risks, and Limitations
- Text Hallucination: The model may occasionally generate repeated or incorrect text.
- Erroneous Image Paths: It may output irrelevant image paths.
- Context Length Constraint: With a maximum context length of 2048 tokens, longer book pages may result in incomplete transcriptions.
Intended Use
Designed for converting images of Arabic book pages into structured text, especially in Markdown format. It's suitable for digitizing Arabic literature and text extraction from printed materials.
Ethical Considerations
Be aware of the model's limitations, especially when accurate OCR results are crucial. Users should verify and review the output, especially in high - precision scenarios.
Model Details
Property |
Details |
Developed by |
Mohamed Rashad |
Model Type |
VisionEncoderDecoderModel |
Language(s) (NLP) |
Arabic & English |
License |
GPL 3.0 |
Finetuned from model |
nougat-small |
Acknowledgment
If you use or build upon the Arabic Small Nougat OCR, please acknowledge the model developer and the open - source community. Also, include a copy of the GPL 3.0 license with any redistributed or modified versions of the model.
Citation
If you find this model useful, please consider citing the original facebook/nougat-small model and the datasets used for fine - tuning, including the Khatt dataset and any details regarding the custom dataset.
@misc{rashad2024arabicnougatfinetuningvisiontransformers,
title={Arabic-Nougat: Fine-Tuning Vision Transformers for Arabic OCR and Markdown Extraction},
author={Mohamed Rashad},
year={2024},
eprint={2411.17835},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2411.17835},
}
@misc {mohamed_rashad_2024,
author = { {Mohamed Rashad} },
title = { arabic-small-nougat (Revision 48741d4) },
year = 2024,
url = { https://huggingface.co/MohamedRashad/arabic-small-nougat },
doi = { 10.57967/hf/3534 },
publisher = { Hugging Face }
}
Disclaimer
The arabic-small-nougat OCR is provided "as is," and the developers make no guarantees regarding its suitability for specific tasks. Users are encouraged to thoroughly evaluate the model's output for their particular use cases and requirements.
đ License
This model is licensed under GPL 3.0.