Qaari 0.1 Urdu开源OCR模型 - 精准识别乌尔都语文本，免费提升识别能力

首页

Qaari 0.1 Urdu OCR VL 2B Instruct

由 oddadmix 开发

Qaari 0.1 Urdu是一款专门为乌尔都语文本的光学字符识别（OCR）优化的模型，基于Qwen/Qwen2-VL-2B进行微调，在乌尔都语OCR能力上有显著提升。

文字识别 #乌尔都语OCR #高精度文本识别 #纳斯塔利克字体优化

下载量 257

发布时间 : 3/10/2025

模型简介

该模型专注于乌尔都语文本的光学字符识别，具有高精度和卓越性能，大幅超越了基础模型和传统OCR解决方案。

模型特点

专为乌尔都语OCR设计

针对乌尔都语脚本识别进行了优化，具有高精度。

卓越性能

与基础模型相比，单词错误率（WER）降低了97.35%。

高精度

WER为0.048，字符错误率（CER）为0.029，BLEU分数为0.916。

输出长度均衡

长度比率接近完美，为0.978（理想值为1.0）。

模型能力

乌尔都语文本识别

高精度OCR

图像文本提取

使用案例

文档处理

乌尔都语文档数字化

将乌尔都语印刷文档转换为可编辑的电子文本。

高精度转换，错误率极低。

多语言OCR

多语言文本识别

支持多种乌尔都语字体和字体大小的识别。

在多种字体和大小下保持高精度。

🚀 Qaari 0.1 Urdu：乌尔都语OCR模型

Qaari 0.1 Urdu是一款专门为乌尔都语文本的光学字符识别（OCR）优化的模型。它基于Qwen/Qwen2-VL-2B进行微调，在乌尔都语OCR能力上有显著提升，大幅超越了基础模型和传统OCR解决方案（如Tesseract）。

🚀 快速开始

你可以使用transformers和qwen_vl_utils库来加载这个模型：

!pip install transformers qwen_vl_utils accelerate>=0.26.0 PEFT -U
!pip install -U bitsandbytes

from PIL import Image
from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
import torch
import os
from qwen_vl_utils import process_vision_info

model_name = "oddadmix/Qaari-0.1-Urdu-OCR-VL-2B-Instruct"
model = Qwen2VLForConditionalGeneration.from_pretrained(
                model_name,
                torch_dtype="auto",
                device_map="auto"
            )
processor = AutoProcessor.from_pretrained(model_name)
max_tokens = 2000

prompt = "Below is the image of one page of a document, as well as some raw textual content that was previously extracted for it. Just return the plain text representation of this document as if you were reading it naturally. Do not hallucinate."
image.save("image.png")

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": f"file://{src}"},
            {"type": "text", "text": prompt},
        ],
    }
]
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")
generated_ids = model.generate(**inputs, max_new_tokens=max_tokens)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
os.remove(src)
print(output_text)

✨ 主要特性

专为乌尔都语OCR设计：针对乌尔都语脚本识别进行了优化，具有高精度。
卓越性能：与基础模型相比，单词错误率（WER）降低了97.35%。
高精度：WER为0.048，字符错误率（CER）为0.029，BLEU分数为0.916。
输出长度均衡：长度比率接近完美，为0.978（理想值为1.0）。

📦 安装指南

!pip install transformers qwen_vl_utils accelerate>=0.26.0 PEFT -U
!pip install -U bitsandbytes

💻 使用示例

基础用法

from PIL import Image
from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
import torch
import os
from qwen_vl_utils import process_vision_info

model_name = "oddadmix/Qaari-0.1-Urdu-OCR-VL-2B-Instruct"
model = Qwen2VLForConditionalGeneration.from_pretrained(
                model_name,
                torch_dtype="auto",
                device_map="auto"
            )
processor = AutoProcessor.from_pretrained(model_name)
max_tokens = 2000

prompt = "Below is the image of one page of a document, as well as some raw textual content that was previously extracted for it. Just return the plain text representation of this document as if you were reading it naturally. Do not hallucinate."
image.save("image.png")

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": f"file://{src}"},
            {"type": "text", "text": prompt},
        ],
    }
]
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")
generated_ids = model.generate(**inputs, max_new_tokens=max_tokens)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
os.remove(src)
print(output_text)

📚 详细文档

性能指标

模型	单词错误率（WER）↓	字符错误率（CER）↓	BLEU分数↑	长度比率
Qaari 0.1 Urdu	0.048	0.029	0.916	0.978
Tesseract	0.352	0.227	0.518	0.770
Qwen Base	1.823	1.739	0.009	1.288

改进百分比

对比对象	WER改进	CER改进	BLEU改进
与Qwen Base对比	97.35%	98.32%	91.55%
与Tesseract对比	86.25%	87.11%	82.60%

支持的字体

AlQalam Taj Nastaleeq Regular
Alvi Nastaleeq Regular
Gandhara Suls Regular
Jameel Noori Nastaleeq Regular
NotoNastaliqUrdu-Regular

支持的字体大小

14pt
16pt
18pt
20pt
24pt
32pt
40pt

局限性

使用微调数据集中未包含的字体时，性能可能会下降。
超出支持范围的字体大小可能会导致渲染效果不佳。
该模型可能无法有效处理非纳斯塔利克字体中的复杂连字。
在纯数字显示设备上的性能尚未完全优化。
低分辨率打印环境可能会出现质量下降的情况。
自定义字体修改或非标准的纳斯塔利克变体可能无法按预期渲染。

训练详情

训练数据集

数据集类型：带有配对转录的乌尔都语文本图像
大小：10,000
来源：合成数据集

训练配置

基础模型：Qwen/Qwen2-VL-2B
硬件：A6000 GPU
训练时间：24小时

🔧 技术细节

该模型基于Qwen2-VL-2B进行微调，使用了包含乌尔都语文本图像和配对转录的数据集。训练过程着重于优化乌尔都语字符的准确识别和自然语言理解。

📄 许可证

该模型遵循基础模型Qwen2-VL-2B的许可条款。

引用

如果你在研究中使用了这个模型，请引用：

@misc{qaari-0.1-urdu,
  author = {Ahmed Wasfy},
  title = {Qaari 0.1 Urdu: OCR Model for Urdu Language},
  year = {2025},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/oddadmix/Qaari-0.1-Urdu-OCR-VL-2B-Instruct}}
}

image/png