Free and open source! Arabic-large-nougat converts Arabic book images into structured text

Arabic Large Nougat

Developed by MohamedRashad

An end-to-end structured optical character recognition system specifically designed for Arabic, converting book page images into structured text (Markdown format)

Image-to-Text

Transformers

Supports Multiple LanguagesOpen Source License:Gpl-3.0 #Arabic OCR #Book Digitization #End-to-End Recognition

Downloads 537

Release Time : 10/18/2024

Model Overview

This model is trained from scratch with a novel tokenizer, based on the foundational Nougat architecture, suitable for fields such as Arabic literature digitization and printed material text extraction.

Model Features

Arabic-Specific OCR

Optical character recognition system optimized specifically for Arabic text

Structured Output

Capable of generating structured text output in Markdown format

End-to-End Solution

Complete processing pipeline directly from image to text, with no intermediate steps required

Book Processing Optimization

Particularly suitable for processing Arabic book pages

Model Capabilities

Arabic Text Recognition

English Text Recognition

Book Page Processing

Markdown Format Generation

Use Cases

Literature Digitization

Digitization of Ancient Arabic Texts

Converting printed ancient Arabic texts into searchable digital text

Preserves the original text structure and formatting

Education

Textbook Content Extraction

Extracting text content from Arabic textbooks for e-learning purposes

Structured output facilitates further processing

🚀 Arabic Large Nougat

End-to-End Structured OCR For Arabic books.

This model is an end-to-end structured Optical Character Recognition (OCR) system tailored for the Arabic language, offering a solution for converting images of Arabic book pages into structured text, especially in Markdown format.

🚀 Quick Start

Demo

You can try the model through the online demo: Demo Link

Local Usage

First, make sure to update the transformers library:

pip install -U transformers

Here is the code example to use the model locally:

from PIL import Image
import torch
from transformers import NougatProcessor, VisionEncoderDecoderModel

# Load the model and processor
processor = NougatProcessor.from_pretrained("MohamedRashad/arabic-large-nougat")
model = VisionEncoderDecoderModel.from_pretrained(
    "MohamedRashad/arabic-large-nougat",
    torch_dtype=torch.bfloat16,
    attn_implementation={"decoder": "flash_attention_2", "encoder": "eager"},
)

# Get the max context length of the model & dtype of the weights
context_length = model.decoder.config.max_position_embeddings
torch_dtype = model.dtype

# Move the model to GPU if available
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)


def predict(img_path):
    # prepare PDF image for the model
    image = Image.open(img_path)
    pixel_values = (
        processor(image, return_tensors="pt").pixel_values.to(torch_dtype).to(device)
    )

    # generate transcription
    outputs = model.generate(
        pixel_values.to(device),
        repetition_penalty=1.5,
        min_length=1,
        max_new_tokens=context_length,
        bad_words_ids=[[processor.tokenizer.unk_token_id]],
    )

    page_sequence = processor.batch_decode(outputs, skip_special_tokens=True)[0]
    return page_sequence


print(predict("path/to/page_image.jpg"))

✨ Features

End - to - End OCR: Specifically designed for Arabic books, it can directly convert images of book pages into structured text.
Based on New Tokenizer: Trained from scratch using the riotu-lab/Aranizer-PBE-86k tokenizer with the base nougat architecture.
Trained on Custom Dataset: Utilizes the MohamedRashad/arabic-img2md dataset for training.

📚 Documentation

Bias, Risks, and Limitations

Text Hallucination: Due to the complexity of OCR tasks, the model may occasionally generate repeated or incorrect text.
Erroneous Image Paths: Sometimes, the model may output image paths that are not relevant to the input, indicating occasional confusion.
Context Length Constraint: The model has a maximum context length of 2048 tokens, which may lead to incomplete transcriptions for longer book pages.

Intended Use

The arabic-large-nougat OCR is designed for tasks involving the conversion of Arabic book page images into structured text, particularly when Markdown format is preferred. It is suitable for digitizing Arabic literature and facilitating text extraction from printed materials.

Ethical Considerations

It is essential to be aware of the model's limitations, especially in cases where accurate OCR results are crucial. Users are advised to verify and review the output, especially in scenarios where precision is of utmost importance.

🔧 Technical Details

Developed by: Mohamed Rashad
Model type: VisionEncoderDecoderModel
Language(s) (NLP): Arabic & English
License: GPL 3.0

📄 License

The model is licensed under GPL 3.0. If you use or build upon the arabic-large-nougat OCR, please acknowledge the model developer and the open-source community for their contributions. Additionally, be sure to include a copy of the GPL 3.0 license with any redistributed or modified versions of the model.

Citation

If you find this model useful, please cite the corresponding research paper:

@misc{rashad2024arabicnougatfinetuningvisiontransformers,
      title={Arabic-Nougat: Fine-Tuning Vision Transformers for Arabic OCR and Markdown Extraction}, 
      author={Mohamed Rashad},
      year={2024},
      eprint={2411.17835},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2411.17835}, 
}

Disclaimer

The arabic-large-nougat OCR is a tool provided "as is," and the developers make no guarantees regarding its suitability for specific tasks. Users are encouraged to thoroughly evaluate the model's output for their particular use cases and requirements.

[**Github**](https://github.com/MohamedAliRashad/arabic-nougat) 🤗 [**Hugging Face**](https://huggingface.co/collections/MohamedRashad/arabic-nougat-673a3f540bd92904c9b92a8e) 📝 [**Paper**](https://arxiv.org/abs/2411.17835) 🗂️ [**Data**](https://huggingface.co/datasets/MohamedRashad/arabic-img2md) 📽️ [**Demo**](https://huggingface.co/spaces/MohamedRashad/Arabic-Nougat)

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご