Multicentury-HTR-Model Open-source Handwritten Text Recognition Model - Facilitating the Digitalization of Swedish and Finnish Historical Documents

Multicentury Htr Model

Developed by Kansallisarkisto

A Transformer-based handwritten text recognition model, specifically designed for Swedish and Finnish, suitable for historical document digitization.

Text Recognition

PyTorch

OtherOpen Source License:Apache-2.0 #Handwritten Text Recognition #Multi-century Handwriting #Nordic Language OCR

Downloads 39

Release Time : 10/7/2024

Model Overview

This model is a fine-tuned version of microsoft/trocr-large-handwritten, focusing on recognizing handwritten texts from the 17th to 20th centuries, supporting document digitization and handwritten note transcription.

Model Features

Multi-century Handwriting Support

Training data includes handwriting samples from the 17th to 20th centuries, adapting to diverse writing styles.

Nordic Language Optimization

Specially optimized for special characters in Finnish and Swedish (e.g., å, ä, ö).

High Accuracy Recognition

Achieves a character error rate (CER) of 3.2 on the test set, demonstrating excellent performance.

Model Capabilities

Handwritten Text Recognition

Historical Document Transcription

Table Data Extraction

Use Cases

Archive Digitization

Historical Manuscript Transcription

Convert historical handwritten documents in archives into searchable digital text.

CER 3.2 (test set of 94,900 lines of text)

Personal Applications

Handwritten Note Transcription

Convert personal handwritten notes into electronic text format.

🚀 Multicentury Handwritten Text Recognition Model

A fine - tuned Transformer - based OCR model specialized for recognizing handwritten text in Swedish and Finnish.

🚀 Quick Start

You can use the model directly with Hugging Face’s pipeline function or by manually loading the processor and model.

from transformers import TrOCRProcessor, VisionEncoderDecoderModel
from PIL import Image

# Load the model and processor
processor = TrOCRProcessor.from_pretrained("Kansallisarkisto/multicentury-htr-model/processor")
model = VisionEncoderDecoderModel.from_pretrained("Kansallisarkisto/multicentury-htr-model")

# Open an image of handwritten text
image = Image.open("path_to_image.png")

# Preprocess and predict
pixel_values = processor(image, return_tensors="pt").pixel_values
generated_ids = model.generate(pixel_values)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]

print(generated_text)

✨ Features

Specialized for handwritten text recognition in Swedish and Finnish.
Based on a Transformer architecture (TrOCR) with an encoder - decoder setup.
Trained on various datasets from the 17th to 20th centuries.

📦 Installation

No specific installation steps are provided in the original document.

📚 Documentation

Model Description

Model Name: multicentury - htr - model
Model Type: Transformer - based OCR (TrOCR)
Base Model: microsoft/trocr - large - handwritten
Purpose: Handwritten text recognition
Languages: Swedish, Finnish
License: Apache 2.0

This model is a fine - tuned version of the microsoft/trocr - large - handwritten model, specialized for recognizing handwritten text. It has been trained on various datasets from the 17th to 20th centuries and can be used for applications such as document digitization, form recognition, or any task involving handwritten text extraction.

Model Architecture

The model is based on a Transformer architecture (TrOCR) with an encoder - decoder setup:

The encoder processes images of handwritten text.
The decoder generates corresponding text output.

Intended Use

This model is designed for handwritten text recognition and is intended for use in:

Document digitization (e.g., archival work, historical manuscripts)
Handwritten notes transcription

Training Data

The training dataset includes more than 760,000 samples of handwritten text rows, covering a wide variety of handwriting styles and text samples.

Evaluation

The model was evaluated on a test dataset. Below are key metrics:

Property	Details
Character Error Rate (CER)	3.2
Test Dataset Description	size ~94,900 text rows

Limitations and Biases

The model was trained primarily on handwritten text that uses basic Latin characters (A - Z, a - z) and includes Nordic special characters (å, ä, ö). It has not been trained on non - Latin alphabets, such as Chinese characters, Cyrillic script, or other writing systems like Arabic or Hebrew. The model may not generalize well to any other languages than Finnish, Swedish or English.

Future Work

Potential improvements for this model include:

Expanding training data: Incorporating more diverse handwriting styles and languages.
Optimizing for specific domains: Fine - tuning the model on domain - specific handwriting.

📄 License

This model is licensed under the Apache 2.0 license.

📚 Citation

If you use this model in your work, please cite it as:

@misc{multicentury_htr_model_2024,
  author = {Kansallisarkisto},
  title = {Multicentury HTR Model: Handwritten Text Recognition},
  year = {2024},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/Kansallisarkisto/multicentury-htr-model/}},
}

📇 Model Card Authors

Author: Kansallisarkisto
Contact Information: riikka.marttila@kansallisarkisto.fi, ilkka.jokipii@kansallisarkisto.fi

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご