Qwen-for-Jawi-v1 Open-Source Jawi Script OCR Model - Free Recognition of Historical Malay Texts

Home

Qwen For Jawi V1

Developed by culturalheritagenus

A Jawi OCR model fine-tuned from Qwen2-VL-2B-Instruct, specifically designed for recognizing historical Malay texts

Image-to-Text

Transformers

#Jawi OCR #Historical document digitization #Multimodal vision-language model

Downloads 155

Release Time : 10/3/2024

Model Overview

This vision-language model is optimized for Optical Character Recognition (OCR) tasks of Jawi (Malay written in Arabic script) historical documents, primarily used for cultural heritage digitization and historical text analysis

Model Features

Specialized Jawi OCR

Specifically optimized for recognizing Jawi characters in historical Malay documents

Cultural heritage preservation

Supports digitization efforts for Malay cultural heritage

Performance advantage

Shows significant advantages over general Arabic OCR models in Jawi recognition

Model Capabilities

Image text recognition

Historical document digitization

Jawi transcription

Use Cases

Cultural heritage preservation

Historical document digitization

Converting Jawi-written Malay historical documents into editable text

CER 8.66%, WER 25.50%

Academic research

Historical text analysis

Enabling computational analysis of Jawi historical texts

🚀 Model Card for Model qwen-for-jawi-v1

This model card provides detailed information about the qwen-for-jawi-v1 model, which is specialized for Optical Character Recognition (OCR) of historical Malay texts written in Jawi script.

✨ Features

Specialized for OCR of historical Malay manuscripts in Jawi script.
Based on the Qwen2-VL-2B-Instruct model, a vision - language model.
Enables digital preservation of Malay cultural heritage and computational analysis of historical texts.

📦 Installation

The README doesn't provide specific installation steps, so this section is skipped.

💻 Usage Examples

Basic Usage

# Example code for loading and using the model
from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
import torch
from qwen_vl_utils import process_vision_info
from PIL import Image

model_name = 'mevsg/qwen-for-jawi-v1'

model = Qwen2VLForConditionalGeneration.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,  # Use the appropriate torch dtype if needed
    device_map='auto'            # Optional: automatically allocate model layers across devices
)

# Load the processor from Hugging Face Hub
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-2B-Instruct")

# Add example usage code
image_path = 'path/to/image'
image = Image.open(image_path).convert('RGB')

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": image,
            },
            {"type": "text", "text": "Convert this image to text"},
        ],
    }
]

# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)

print(output_text)

📚 Documentation

Model Description

This model is a fine - tuned version of Qwen/Qwen2-VL-7B-Instruct specialized for Optical Character Recognition (OCR) of historical Malay texts written in Jawi script (Arabic script adapted for Malay language).

Property	Details
Model Type	Vision-Language Model
Base Model	Qwen2-VL-2B-Instruct
Parameters	2 billion
Language(s)	Malay (Jawi script)

Intended Use

Primary Intended Uses

OCR for historical Malay manuscripts written in Jawi script.
Digital preservation of Malay cultural heritage.
Enabling computational analysis of historical Malay texts.

Out-of-Scope Uses

General Arabic text recognition.
Modern Malay text processing.
Real-time OCR applications.

Training Data

Dataset Description

This was trained and evaluated using

Training Procedure

Hardware used: 1 x H100
Training time: 6 hours

Performance and Limitations

Performance Metrics

Character Error Rate (CER): 8.66
Word Error Rate (WER): 25.50

Comparison with Other Models

We compared this model with https://github.com/VikParuchuri/surya, which reports high accuracy rates for Arabic, but performs poorly on our Jawi data:

Character Error Rate (CER): 70.89%
Word Error Rate (WER): 91.73%

🔧 Technical Details

The README doesn't provide in - depth technical details, so this section is skipped.

📄 License

The README doesn't provide license information, so this section is skipped.

📄 Citation

@misc{qwen-for-jawi-v1,
  title     = {Qwen for Jawi v1: a model for Jawi OCR},
  author    = {[Miguel Escobar Varela]}, 
  year      = {2024},
  publisher = {HuggingFace},
  url       = {[https://huggingface.co/mevsg/qwen-for-Jawi-v1]},
  note      = {Model created at National University of Singapore }
}

Acknowledgements

Special thanks to William Mattingly, whose finetuning script served as the base for our finetuning approach: https://github.com/wjbmattingly/qwen2-vl-finetune-huggingface

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご