Open-source Handwritten Text Recognition Model for Court Records - Freely Recognize 19th-Century Finnish and Swedish Court Records

Court Records Htr

Developed by Kansallisarkisto

A handwriting recognition model fine-tuned from Microsoft's TrOCR, specialized for 19th-century Finnish and Swedish court record documents

Text Recognition

PyTorch

Open Source License:MIT #Historical handwriting recognition #Finnish-Swedish OCR #Court archives digitization

Downloads 24

Release Time : 9/12/2024

Model Overview

This model is designed to recognize handwritten text from line images, specifically optimized for digitized 19th-century Finnish and Swedish court record documents.

Model Features

Specialized for historical documents

Trained specifically for 19th-century handwriting characteristics, excelling in historical document recognition tasks

Multilingual support

Supports handwriting recognition in both Finnish and Swedish

High-accuracy recognition

Achieves 2.4% character error rate and 11.3% word error rate on validation set

Model Capabilities

Handwriting recognition

Historical document processing

Multilingual text extraction

Use Cases

Historical archives digitization

Court records transcription

Converting 19th-century handwritten court records into searchable digital text

Achieves high-accuracy automatic transcription with only 2.4% character error rate

Genealogical research

Historical population records processing

Automatically recognizing handwritten information in historical population registers

🚀 Handwritten text recognition for Finnish 19th century court records

This model performs handwritten text recognition from text line images. It was fine - tuned from Microsoft's TrOCR model using digitized 19th - century court record documents in Finnish and Swedish.

🚀 Quick Start

The model can be used for predicting the text content of images. It is recommended to use GPU for inference if available.

from transformers import TrOCRProcessor, VisionEncoderDecoderModel
from PIL import Image
import torch

# Use GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Model location in Huggingface Hub
model_checkpoint = "Kansallisarkisto/court-records-htr"
# Path to textline image
line_image_path = "/path/to/textline_image.jpg"

# Initialize processor and model
processor = TrOCRProcessor.from_pretrained(model_checkpoint)
model = VisionEncoderDecoderModel.from_pretrained(model_checkpoint).to(device)

# Open image file and extract pixel values
image = Image.open(line_image_path).convert("RGB")
pixel_values = processor(image, return_tensors="pt").pixel_values

# Use the model to generate predictions 
generated_ids = model.generate(pixel_values.to(device))
# Use the processor to decode ids to text
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(generated_text)

The model downloaded from the HuggingFace Hub is saved locally to ~/.cache/huggingface/hub/.

✨ Features

Performs handwritten text recognition from text line images.
Fine - tuned with 19th - century court record documents in Finnish and Swedish.

📦 Installation

No specific installation steps are provided in the original README.

💻 Usage Examples

Basic Usage

from transformers import TrOCRProcessor, VisionEncoderDecoderModel
from PIL import Image
import torch

# Use GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Model location in Huggingface Hub
model_checkpoint = "Kansallisarkisto/court-records-htr"
# Path to textline image
line_image_path = "/path/to/textline_image.jpg"

# Initialize processor and model
processor = TrOCRProcessor.from_pretrained(model_checkpoint)
model = VisionEncoderDecoderModel.from_pretrained(model_checkpoint).to(device)

# Open image file and extract pixel values
image = Image.open(line_image_path).convert("RGB")
pixel_values = processor(image, return_tensors="pt").pixel_values

# Use the model to generate predictions 
generated_ids = model.generate(pixel_values.to(device))
# Use the processor to decode ids to text
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(generated_text)

📚 Documentation

Intended uses & limitations

The model has been trained to recognize handwritten text from a specific type of 19th - century data, and may generalize poorly to other datasets. The model takes as input text line images, and the use of other types of inputs are not recommended.

Training data

Model was trained using 314 228 text line images from 19th - century court records, while the validation dataset contained 39 042 text line images.

Training procedure

This model was trained using a NVIDIA RTX A6000 GPU with the following hyperparameters:

train batch size: 24
epochs: 13
optimizer: AdamW
maximum length of text sequence: 64

For other parameters, the default values were used (find more information here). The training code is available in the train_trocr.py code file.

Evaluation results

Evaluation results using the validation dataset are listed below:

Validation loss	Validation CER	Validation WER
0.248	0.024	0.113

The metrics were calculated using the Evaluate library. More information on the CER metric can be found here. More information on the WER metric can be found here.

🔧 Technical Details

The model is based on fine - tuning Microsoft's TrOCR model. It uses a NVIDIA RTX A6000 GPU for training with specific hyperparameters as mentioned above.

📄 License

The model is licensed under the MIT license.

Property	Details
Base Model	microsoft/trocr-base-handwritten
Pipeline Tag	image-to-text
Metrics	cer, wer
License	mit

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご