🚀 QARI-OCR v0.3: Structural Arabic Document Understanding
A specialized vision - language model for Arabic Optical Character Recognition with a focus on structural document understanding.
🚀 Quick Start
You can try Qari on Google Colab: Notebook
Load this model using the transformers
and qwen_vl_utils
library:
!pip install transformers qwen_vl_utils accelerate>=0.26.0 PEFT -U
!pip install -U bitsandbytes
✨ Features
- 📐 Layout-Aware Recognition: Preserves document structure with HTML/Markdown tags
- 🔤 Full Diacritics Support: Accurate recognition of tashkeel (Arabic diacritical marks)
- 📝 Multi-Font Handling: Trained on 12 diverse Arabic fonts (14px - 100px)
- 🎯 Structure-First Design: Optimized for documents with headers, body text, and complex layouts
- ⚡ Efficient Training: Only 11 hours on single GPU with 10k samples
- 🖼️ Robust Performance: Handles low-resolution and degraded images
💻 Usage Examples
Basic Usage
from PIL import Image
from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
import torch
import os
from qwen_vl_utils import process_vision_info
model_name = "NAMAA-Space/Qari-OCR-v0.3-VL-2B-Instruct"
model = Qwen2VLForConditionalGeneration.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto"
)
processor = AutoProcessor.from_pretrained(model_name)
max_tokens = 2000
prompt = "Below is the image of one page of a document, as well as some raw textual content that was previously extracted for it. Just return the plain text representation of this document as if you were reading it naturally. Do not hallucinate."
image.save("image.png")
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": f"file://{src}"},
{"type": "text", "text": prompt},
],
}
]
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)
inputs = inputs.to("cuda")
generated_ids = model.generate(**inputs, max_new_tokens=max_tokens)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
os.remove(src)
print(output_text)
📚 Documentation
Model Description
Model Performance
Property |
Details |
Character Error Rate (CER) |
0.300 |
Word Error Rate (WER) |
0.485 |
BLEU Score |
0.545 |
Training Time |
11 hours |
CO₂ Emissions |
1.88 kg eq. |
Comparative Strengths
While QARI v0.2 achieves better raw text accuracy (CER: 0.061), QARI v0.3 excels in:
- ✅ HTML/Markdown structure preservation
- ✅ Document layout understanding
- ✅ Handwritten text recognition (initial capabilities)
- ✅ 5x faster training than v0.2
Training Details
- Base Model: Qwen2-VL-2B-Instruct
- Training Data: 10,000 synthetic Arabic documents with HTML markup
- Optimization: 4-bit LoRA adapters (rank = 16)
- Hardware: Single NVIDIA A6000 GPU (48GB)
- Framework: Unsloth + Hugging Face TRL
BibTeX:
@article{wasfy2025qari,
title={QARI-OCR: High-Fidelity Arabic Text Recognition through Multimodal Large Language Model Adaptation},
author={Wasfy, Ahmed and Nacar, Omer and Elkhateb, Abdelakreem and Reda, Mahmoud and Elshehy, Omar and Ammar, Adel and Boulila, Wadii},
journal={arXiv preprint arXiv:2506.02295},
year={2025}
}
📄 License
This model is licensed under the apache-2.0 license.