Arabic Base Nougat Free OCR System - End-to-End Document Recognition Designed Specifically for Arabic

Arabic Base Nougat

Developed by MohamedRashad

An end-to-end structured optical character recognition (OCR) system specifically designed for Arabic, fine-tuned based on the facebook/nougat-base architecture

Image-to-Text

Transformers

Supports Multiple LanguagesOpen Source License:Gpl-3.0 #Arabic OCR #Book Digitization #End-to-End Structuring

Downloads 130

Release Time : 10/13/2024

Model Overview

This model is an end-to-end structured OCR system for Arabic books, capable of converting Arabic book page images into structured text, particularly suitable for scenarios requiring Markdown format.

Model Features

Arabic OCR Optimization

Specially optimized for Arabic text, capable of accurately recognizing complex layouts and characters in Arabic book pages

Structured Output

Supports generating structured text output in Markdown format, preserving the original document's formatting information

End-to-End Processing

Directly processes from image input to text output without intermediate steps

Model Capabilities

Arabic Text Recognition

English Text Recognition

Book Page Image Processing

Markdown Format Generation

Use Cases

Literature Digitization

Digitization of Ancient Arabic Texts

Converting printed ancient Arabic texts into editable digital text

Structured text preserving original layout and formatting

Education

Textbook Content Extraction

Extracting teaching content from scanned Arabic textbooks

Editable textbook text, facilitating the creation of e-textbooks

🚀 Arabic Base Nougat

An end-to-end structured Optical Character Recognition (OCR) system tailored for Arabic books.

🚀 Quick Start

You can try the model through the demo.

Or, use the code below to get started with the model locally.

⚠️ Important Note

Don't forget to update transformers using the command pip install -U transformers.

from PIL import Image
import torch
from transformers import NougatProcessor, VisionEncoderDecoderModel

# Load the model and processor
processor = NougatProcessor.from_pretrained("MohamedRashad/arabic-base-nougat")
model = VisionEncoderDecoderModel.from_pretrained("MohamedRashad/arabic-base-nougat", torch_dtype=torch.bfloat16, attn_implementation={"decoder": "flash_attention_2", "encoder": "eager"})

# Get the max context length of the model & dtype of the weights
context_length = model.decoder.config.max_position_embeddings
torch_dtype = model.dtype

# Move the model to GPU if available
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)

def predict(img_path):
    # prepare PDF image for the model
    image = Image.open(img_path)
    pixel_values = processor(image, return_tensors="pt").pixel_values.to(torch_dtype).to(device)

    # generate transcription
    outputs = model.generate(
        pixel_values.to(device),
        repetition_penalty=1.5,
        min_length=1,
        max_new_tokens=context_length,
        bad_words_ids=[[processor.tokenizer.unk_token_id]],
    )

    page_sequence = processor.batch_decode(outputs, skip_special_tokens=True)[0]
    page_sequence = processor.post_process_generation(page_sequence, fix_markdown=False)
    return page_sequence

print(predict("path/to/page_image.jpg"))

✨ Features

The arabic-base-nougat OCR is an end-to-end structured Optical Character Recognition (OCR) system designed specifically for the Arabic language. The model is based on the facebook/nougat-base architecture and has been fine-tuned using MohamedRashad/arabic-img2md.

📚 Documentation

Bias, Risks, and Limitations

Text Hallucination: The model may occasionally generate repeated or incorrect text due to the inherent complexities of OCR tasks.
Erroneous Image Paths: There are instances where the model outputs image paths that are not relevant to the input, indicating occasional confusion.
Context Length Constraint: The model has a maximum context length of 2048 tokens, which may result in incomplete transcriptions for longer book pages.

Intended Use

The arabic-base-nougat OCR is designed for tasks that involve converting images of Arabic book pages into structured text, especially when Markdown format is desired. It is suitable for applications in the field of digitizing Arabic literature and facilitating text extraction from printed materials.

Ethical Considerations

It is crucial to be aware of the model's limitations, particularly in instances where accurate OCR results are critical. Users are advised to verify and review the output, especially in scenarios where precision is paramount.

Model Details

Property	Details
Developed by	Mohamed Rashad
Model Type	VisionEncoderDecoderModel
Language(s) (NLP)	Arabic & English
License	GPL 3.0
Finetuned from model	nougat-base

Acknowledgment

If you use or build upon the arabic-base-nougat OCR, please acknowledge the model developer and the open-source community for their contributions. Additionally, be sure to include a copy of the GPL 3.0 license with any redistributed or modified versions of the model. By selecting the GPL 3.0 license, you promote the principles of open source and ensure that the benefits of the model are shared with the broader community.

Citation

If you find this model useful, please cite the corresponding research paper:

@misc{rashad2024arabicnougatfinetuningvisiontransformers,
      title={Arabic-Nougat: Fine-Tuning Vision Transformers for Arabic OCR and Markdown Extraction}, 
      author={Mohamed Rashad},
      year={2024},
      eprint={2411.17835},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2411.17835}, 
}

Disclaimer

The arabic-base-nougat OCR is a tool provided "as is," and the developers make no guarantees regarding its suitability for specific tasks. Users are encouraged to thoroughly evaluate the model's output for their particular use cases and requirements.

[**Github**](https://github.com/MohamedAliRashad/arabic-nougat) 🤗 [**Hugging Face**](https://huggingface.co/collections/MohamedRashad/arabic-nougat-673a3f540bd92904c9b92a8e) 📝 [**Paper**](https://arxiv.org/abs/2411.17835) 🗂️ [**Data**](https://huggingface.co/datasets/MohamedRashad/arabic-img2md) 📽️ [**Demo**](https://huggingface.co/spaces/MohamedRashad/Arabic-Nougat)

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご