Qaari 0.1ウルドゥー語オープンソースOCRモデル - ウルドゥー語のテキストを高精度で識別し、無料で識別能力を向上させます

Home

Qaari 0.1 Urdu OCR VL 2B Instruct

Developed by oddadmix

Qaari 0.1 ウルドゥー語は、ウルドゥー語テキストの光学文字認識（OCR）に特化して最適化されたモデルで、Qwen/Qwen2-VL-2Bをベースに微調整され、ウルドゥー語OCR能力が著しく向上しています。

文字認識 #ウルドゥー語OCR #高精度テキスト認識 #ナスタリーリク字体最適化

Downloads 257

Release Time : 3/10/2025

Model Overview

このモデルはウルドゥー語テキストの光学文字認識に特化しており、高精度で卓越した性能を持ち、ベースモデルや従来のOCRソリューションを大幅に上回っています。

Model Features

ウルドゥー語OCR用に設計

ウルドゥー語のスクリプト認識に最適化されており、高精度です。

卓越した性能

ベースモデルと比較して、単語誤り率（WER）が97.35％低下しています。

高精度

WERが0.048、文字誤り率（CER）が0.029、BLEUスコアが0.916です。

出力長の均衡

長さ比率が0.978（理想値は1.0）で、ほぼ完璧に近いです。

Model Capabilities

ウルドゥー語テキスト認識

高精度OCR

画像テキスト抽出

Use Cases

文書処理

ウルドゥー語文書のデジタル化

ウルドゥー語の印刷文書を編集可能な電子テキストに変換します。

高精度で変換され、エラー率が極めて低いです。

多言語OCR

多言語テキスト認識

複数のウルドゥー語字体と字体サイズの認識をサポートします。

複数の字体とサイズで高精度を維持します。

🚀 Qaari 0.1 Urdu：ウルドゥー語OCRモデル

Qaari 0.1 Urduは、ウルドゥー語テキストの光学式文字認識（OCR）に特化して最適化されたモデルです。Qwen/Qwen2-VL-2Bをベースに微調整されており、ウルドゥー語OCR能力が大幅に向上し、ベースモデルや従来のOCRソリューション（Tesseractなど）を大きく上回っています。

🚀 クイックスタート

transformersとqwen_vl_utilsライブラリを使ってこのモデルをロードできます。

!pip install transformers qwen_vl_utils accelerate>=0.26.0 PEFT -U
!pip install -U bitsandbytes

from PIL import Image
from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
import torch
import os
from qwen_vl_utils import process_vision_info

model_name = "oddadmix/Qaari-0.1-Urdu-OCR-VL-2B-Instruct"
model = Qwen2VLForConditionalGeneration.from_pretrained(
                model_name,
                torch_dtype="auto",
                device_map="auto"
            )
processor = AutoProcessor.from_pretrained(model_name)
max_tokens = 2000

prompt = "Below is the image of one page of a document, as well as some raw textual content that was previously extracted for it. Just return the plain text representation of this document as if you were reading it naturally. Do not hallucinate."
image.save("image.png")

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": f"file://{src}"},
            {"type": "text", "text": prompt},
        ],
    }
]
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")
generated_ids = model.generate(**inputs, max_new_tokens=max_tokens)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
os.remove(src)
print(output_text)

✨ 主な機能

ウルドゥー語OCR専用：ウルドゥー語スクリプトの認識に最適化されており、高精度です。
卓越した性能：ベースモデルと比較して、単語誤り率（WER）が97.35%低下しています。
高精度：WERが0.048、文字誤り率（CER）が0.029、BLEUスコアが0.916です。
出力長のバランス：長さ比率が0.978（理想値は1.0）で、ほぼ完璧に近いです。

📦 インストール

!pip install transformers qwen_vl_utils accelerate>=0.26.0 PEFT -U
!pip install -U bitsandbytes

💻 使用例

基本的な使用法

from PIL import Image
from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
import torch
import os
from qwen_vl_utils import process_vision_info

model_name = "oddadmix/Qaari-0.1-Urdu-OCR-VL-2B-Instruct"
model = Qwen2VLForConditionalGeneration.from_pretrained(
                model_name,
                torch_dtype="auto",
                device_map="auto"
            )
processor = AutoProcessor.from_pretrained(model_name)
max_tokens = 2000

prompt = "Below is the image of one page of a document, as well as some raw textual content that was previously extracted for it. Just return the plain text representation of this document as if you were reading it naturally. Do not hallucinate."
image.save("image.png")

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": f"file://{src}"},
            {"type": "text", "text": prompt},
        ],
    }
]
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")
generated_ids = model.generate(**inputs, max_new_tokens=max_tokens)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
os.remove(src)
print(output_text)

📚 ドキュメント

性能指標

モデル	単語誤り率（WER）↓	文字誤り率（CER）↓	BLEUスコア↑	長さ比率
Qaari 0.1 Urdu	0.048	0.029	0.916	0.978
Tesseract	0.352	0.227	0.518	0.770
Qwen Base	1.823	1.739	0.009	1.288

改善率

比較対象	WER改善	CER改善	BLEU改善
Qwen Baseとの比較	97.35%	98.32%	91.55%
Tesseractとの比較	86.25%	87.11%	82.60%

対応フォント

AlQalam Taj Nastaleeq Regular
Alvi Nastaleeq Regular
Gandhara Suls Regular
Jameel Noori Nastaleeq Regular
NotoNastaliqUrdu-Regular

対応フォントサイズ

14pt
16pt
18pt
20pt
24pt
32pt
40pt

制限事項

微調整データセットに含まれていないフォントを使用すると、性能が低下する可能性があります。
対応範囲外のフォントサイズでは、レンダリング結果が悪くなることがあります。
このモデルは、非ナスタリーリックフォントの複雑な連字をうまく処理できない可能性があります。
純粋な数字表示デバイスでの性能は、まだ完全に最適化されていません。
低解像度の印刷環境では、品質が低下することがあります。
カスタムフォントの変更や非標準のナスタリーリックバリエーションは、期待通りにレンダリングされない可能性があります。

学習詳細

学習データセット

データセットタイプ：ウルドゥー語テキスト画像とペアの転写データ
サイズ：10,000
ソース：合成データセット

学習設定

ベースモデル：Qwen/Qwen2-VL-2B
ハードウェア：A6000 GPU
学習時間：24時間

🔧 技術詳細

このモデルはQwen2-VL-2Bをベースに微調整されており、ウルドゥー語テキスト画像とペアの転写データセットを使用しています。学習過程では、ウルドゥー語文字の正確な認識と自然言語理解の最適化に重点が置かれています。

📄 ライセンス

このモデルは、ベースモデルのQwen2-VL-2Bのライセンス条項に従います。

引用

このモデルを研究で使用する場合は、以下を引用してください。

@misc{qaari-0.1-urdu,
  author = {Ahmed Wasfy},
  title = {Qaari 0.1 Urdu: OCR Model for Urdu Language},
  year = {2025},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/oddadmix/Qaari-0.1-Urdu-OCR-VL-2B-Instruct}}
}

image/png