Qari-OCR-v0.3-VL-2B-Instruct Open-source OCR Model - Accurately recognize Arabic structured documents and preserve formatting

Qari OCR V0.3 VL 2B Instruct

Developed by NAMAA-Space

QARI-OCR v0.3 is an optical character recognition vision-language model focused on Arabic structured document understanding. It is built on Qwen2-VL-2B-Instruct and excels at preserving document layout and format.

Text Recognition

Transformers

ArabicOpen Source License:Apache-2.0 #Arabic OCR #Structured document understanding #Multi-font support

Downloads 1,016

Release Time : 4/10/2025

Model Overview

This model is specifically designed for Arabic optical character recognition, especially good at handling structured documents and can preserve HTML tags, document layout, and full diacritics (tashkeel) in Arabic.

Model Features

Layout perception and recognition

Preserve document structure through HTML/Markdown tags

Full diacritic support

Accurately recognize Arabic diacritics (tashkeel)

Multi-font processing

Trained on 12 different Arabic fonts (14px - 100px)

Structure-first design

Optimized for documents containing headings, body text, and complex layouts

Efficient training

Only takes 11 hours to train on a single GPU with 10k samples

Robust performance

Can handle low-resolution and damaged images

Model Capabilities

Arabic text recognition

Document layout understanding

HTML/Markdown structure preservation

Handwritten text recognition (preliminary ability)

Use Cases

Document processing

Digitization of Arabic documents

Convert paper Arabic documents to digital format while preserving the original layout and format

High-fidelity text conversion, preserving HTML/Markdown structure

Academic literature processing

Process Arabic academic literature containing complex layouts and full diacritics

Accurately recognize text content and structure

🚀 QARI-OCR v0.3: Structural Arabic Document Understanding

A specialized vision - language model for Arabic Optical Character Recognition with a focus on structural document understanding.

🚀 Quick Start

You can try Qari on Google Colab: Notebook

Load this model using the transformers and qwen_vl_utils library:

!pip install transformers qwen_vl_utils accelerate>=0.26.0 PEFT -U
!pip install -U bitsandbytes

✨ Features

📐 Layout-Aware Recognition: Preserves document structure with HTML/Markdown tags
🔤 Full Diacritics Support: Accurate recognition of tashkeel (Arabic diacritical marks)
📝 Multi-Font Handling: Trained on 12 diverse Arabic fonts (14px - 100px)
🎯 Structure-First Design: Optimized for documents with headers, body text, and complex layouts
⚡ Efficient Training: Only 11 hours on single GPU with 10k samples
🖼️ Robust Performance: Handles low-resolution and degraded images

💻 Usage Examples

Basic Usage

from PIL import Image
from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
import torch
import os
from qwen_vl_utils import process_vision_info

model_name = "NAMAA-Space/Qari-OCR-v0.3-VL-2B-Instruct"
model = Qwen2VLForConditionalGeneration.from_pretrained(
                model_name,
                torch_dtype="auto",
                device_map="auto"
            )
processor = AutoProcessor.from_pretrained(model_name)
max_tokens = 2000

prompt = "Below is the image of one page of a document, as well as some raw textual content that was previously extracted for it. Just return the plain text representation of this document as if you were reading it naturally. Do not hallucinate."
image.save("image.png")

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": f"file://{src}"},
            {"type": "text", "text": prompt},
        ],
    }
]
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")
generated_ids = model.generate(**inputs, max_new_tokens=max_tokens)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
os.remove(src)
print(output_text)

📚 Documentation

Model Description

QARI-OCR v0.3 is a specialized vision-language model fine-tuned for Arabic Optical Character Recognition with a focus on structural document understanding.
Built on Qwen2-VL-2B-Instruct, this model excels at preserving document layouts, HTML tags, and formatting while transcribing Arabic text.
It is described in detail in the paper QARI-OCR: High-Fidelity Arabic Text Recognition through Multimodal Large Language Model Adaptation.

Model Performance

Property	Details
Character Error Rate (CER)	0.300
Word Error Rate (WER)	0.485
BLEU Score	0.545
Training Time	11 hours
CO₂ Emissions	1.88 kg eq.

Comparative Strengths

While QARI v0.2 achieves better raw text accuracy (CER: 0.061), QARI v0.3 excels in:

✅ HTML/Markdown structure preservation
✅ Document layout understanding
✅ Handwritten text recognition (initial capabilities)
✅ 5x faster training than v0.2

Training Details

Base Model: Qwen2-VL-2B-Instruct
Training Data: 10,000 synthetic Arabic documents with HTML markup
Optimization: 4-bit LoRA adapters (rank = 16)
Hardware: Single NVIDIA A6000 GPU (48GB)
Framework: Unsloth + Hugging Face TRL

BibTeX:

@article{wasfy2025qari,
  title={QARI-OCR: High-Fidelity Arabic Text Recognition through Multimodal Large Language Model Adaptation},
  author={Wasfy, Ahmed and Nacar, Omer and Elkhateb, Abdelakreem and Reda, Mahmoud and Elshehy, Omar and Ammar, Adel and Boulila, Wadii},
  journal={arXiv preprint arXiv:2506.02295},
  year={2025}
}

📄 License

This model is licensed under the apache-2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご