Qari-OCR-0.1-VL-2B-Instruct開源模型 - 免費部署精準識別整頁阿拉伯文本

首頁

Qari OCR 0.1 VL 2B Instruct

由NAMAA-Space開發

基於Qwen2 VL模型微調的阿拉伯語OCR模型，專為整頁阿拉伯文本識別優化

文字識別

Transformers

阿拉伯語開源協議:Apache-2.0 #阿拉伯語OCR #整頁文本識別 #高精度字符提取

下載量 2,965

發布時間 : 2/28/2025

模型概述

該模型是針對阿拉伯語整頁文本光學字符識別(OCR)任務優化的視覺語言模型，在阿拉伯語OCR數據集上微調，顯著提升了識別準確率

模型特點

高精度阿拉伯語OCR

針對阿拉伯語整頁文本優化的識別能力，WER僅0.068，CER僅0.019

整頁文本處理

專門針對整頁阿拉伯文本識別訓練，能處理完整頁面內容

量化優化

採用4bit量化技術，在保持性能的同時減少資源佔用

特定字體優化

針對Almarai、Amiri、Cairo等常用阿拉伯字體特別優化

模型能力

阿拉伯語印刷體識別

整頁文本提取

高精度字符識別

多字體支持

使用案例

文檔數字化

阿拉伯古籍數字化

將阿拉伯語古籍和手稿轉換為可編輯文本

準確率達98.1%字符識別率

商業文檔處理

處理阿拉伯語合同、發票等商業文檔

較傳統OCR工具提升84%準確率

教育應用

教材數字化

將阿拉伯語教材和學術論文轉換為數字文本

BLEU分數達0.860

🚀 Qari-OCR-0.1-VL-2B-Instruct模型

該模型是基於阿拉伯語OCR數據集對unsloth/Qwen2-VL-2B-Instruct進行微調的版本。它經過優化，可對整頁文本進行高精度的阿拉伯語光學字符識別（OCR）。

image/png

🚀 快速開始

你可以使用transformers和qwen_vl_utils庫加載此模型：

!pip install transformers qwen_vl_utils accelerate>=0.26.0 PEFT -U
!pip install -U bitsandbytes

from PIL import Image
from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
import torch
import os
from qwen_vl_utils import process_vision_info

model_name = "NAMAA-Space/Qari-OCR-0.1-VL-2B-Instruct"
model = Qwen2VLForConditionalGeneration.from_pretrained(
                model_name,
                torch_dtype="auto",
                device_map="auto"
            )
processor = AutoProcessor.from_pretrained(model_name)
max_tokens = 2000

prompt = "Below is the image of one page of a document, as well as some raw textual content that was previously extracted for it. Just return the plain text representation of this document as if you were reading it naturally. Do not hallucinate."
image.save("image.png")

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": f"file://{src}"},
            {"type": "text", "text": prompt},
        ],
    }
]
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")
generated_ids = model.generate(**inputs, max_new_tokens=max_tokens)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
os.remove(src)
print(output_text)

✨ 主要特性

基於Qwen2 VL模型微調，在阿拉伯語OCR數據集上進行訓練。
能夠高精度地提取整頁阿拉伯語文本。
經過標準OCR指標評估，在WER、CER和BLEU得分上表現出色。

📦 安裝指南

!pip install transformers qwen_vl_utils accelerate>=0.26.0 PEFT -U
!pip install -U bitsandbytes

💻 使用示例

基礎用法

from PIL import Image
from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
import torch
import os
from qwen_vl_utils import process_vision_info

model_name = "NAMAA-Space/Qari-OCR-0.1-VL-2B-Instruct"
model = Qwen2VLForConditionalGeneration.from_pretrained(
                model_name,
                torch_dtype="auto",
                device_map="auto"
            )
processor = AutoProcessor.from_pretrained(model_name)
max_tokens = 2000

prompt = "Below is the image of one page of a document, as well as some raw textual content that was previously extracted for it. Just return the plain text representation of this document as if you were reading it naturally. Do not hallucinate."
image.save("image.png")

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": f"file://{src}"},
            {"type": "text", "text": prompt},
        ],
    }
]
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")
generated_ids = model.generate(**inputs, max_new_tokens=max_tokens)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
os.remove(src)
print(output_text)

📚 詳細文檔

模型詳情

屬性	詳情
基礎模型	Qwen2 VL
微調數據集	阿拉伯語OCR數據集
目標	高精度提取整頁阿拉伯語文本
支持語言	阿拉伯語
任務	光學字符識別（OCR）
數據集大小	5000條記錄
訓練輪數	1

性能評估

該模型已在標準OCR指標上進行評估，包括單詞錯誤率（WER）、字符錯誤率（CER）和BLEU得分。

指標

模型	單詞錯誤率（WER）↓	字符錯誤率（CER）↓	BLEU得分↑
Qari v0.1模型	0.068	0.019	0.860
Qwen2 VL 2B	1.344	1.191	0.201
EasyOCR	0.908	0.617	0.152
Tesseract OCR	0.428	0.226	0.410

關鍵結果

單詞錯誤率（WER）：0.068（單詞準確率93.2%）
字符錯誤率（CER）：0.019（字符準確率98.1%）
BLEU得分：0.860

性能對比

與基礎模型相比，單詞錯誤率降低95%。
與基礎模型相比，字符錯誤率降低98%。
與基礎模型相比，BLEU得分提高328%。
與Tesseract OCR相比，單詞錯誤率降低84%。
與EasyOCR相比，單詞錯誤率降低92%。

性能對比圖表

單詞錯誤率（WER）和字符錯誤率（CER）對比

BLEU得分對比

侷限性

雖然該阿拉伯語OCR模型在特定條件下表現出色，但仍存在一些侷限性：

字體依賴：模型使用有限的字體集（Almarai-Regular、Amiri-Regular、Cairo-Regular、Tajawal-Regular和NotoNaskhArabic-Regular）進行訓練。因此，在處理其他字體的文本時，尤其是裝飾性或風格化字體，其準確性可能會下降。
字體大小限制：訓練時使用的固定字體大小為16。字體大小的變化，特別是非常小或非常大的文本，可能會降低識別準確率。
不支持變音符號：模型不支持阿拉伯語變音符號（Tashkeel）。依賴變音符號進行消歧的文本可能無法正確識別。
不支持手寫識別：模型未經過手寫文本識別訓練，僅適用於印刷文檔。
整頁處理：模型在整頁文本識別上進行訓練，這可能會影響其在分段文本、裁剪部分或複雜佈局（如表格和多列格式）中的文本的性能。

在實際應用中部署該模型時，應考慮這些侷限性，以確保最佳性能。

📄 許可證

該模型遵循原始Qwen2 VL模型的許可條款。在商業使用前，請仔細閱讀相關條款。

引用

如果您在研究中使用了該模型，請引用：

@misc{QariOCR2025,
  title={Qari-OCR: A High-Accuracy Model for Arabic Optical Character Recognition},
  author={NAMAA},
  year={2025},
  publisher={Hugging Face},
  howpublished={\url{https://huggingface.co/NAMAA-Space/Qari-OCR-0.1-VL-2B-Instruct}},
  note={Accessed: 2025-03-03}
}