DA - BERT_Old_News_V1 Open Source Model - A Must-Have for Processing Semantics of Historical Texts from the Danish Autocratic Period

DA BERT Old News V1

Developed by CALDISS-AAU

The first transformer model trained on historical texts from Denmark's Absolute Monarchy period (1660-1849), developed by researchers at Aalborg University for processing historical texts with significant differences from modern Danish.

Large Language Model

Safetensors

OtherOpen Source License:MIT #Danish historical texts #Domain-specific BERT #Ancient language semantic analysis

Downloads 48

Release Time : 4/1/2025

Model Overview

A BERT model pre-trained on MLM tasks, specifically optimized for historical texts from Denmark's Absolute Monarchy period, enabling better understanding and processing of texts that differ significantly from modern Danish.

Model Features

Historical text optimization

Specifically trained on historical texts from Denmark's Absolute Monarchy period (1660-1849), better capturing semantics that differ significantly from modern Danish.

Custom tokenizer

Uses a custom WordPiece tokenizer optimized for tokenizing historical texts.

High-quality training data

Training data sourced from the ENO corpus, containing news, announcements, and advertisements from Danish and Norwegian newspapers between 1762 and 1848, with a word-level error rate of approximately 5%.

Model Capabilities

Masked language modeling

Historical text semantic understanding

Use Cases

Historical research

Historical text analysis

Used to analyze historical texts from Denmark's Absolute Monarchy period, helping researchers understand language usage and social context of the time.

Historical document translation assistance

Assists in translating historical documents by providing more accurate semantic understanding.

Linguistics

Language evolution research

Used to study the evolution of Danish from the Absolute Monarchy period to modern times.

🚀 Model Card for Model ID

DA-Bert_Old_News_V1 is the first version of a transformer. It was trained on Danish historical texts from the period of Danish Absolutism (1660 - 1849) by researchers at Aalborg University. The model aims to create a domain - specific solution for extracting meaning from texts that differ significantly from contemporary Danish.

✨ Features

Domain - specific masked token prediction.
Embedding extraction for semantic search.
Suitable for further fine - tuning.

📦 Installation

No installation details are provided in the original document.

💻 Usage Examples

Basic Usage

# Use the code below to get started with the model.
# [More Information Needed]

📚 Documentation

Model Details

This is a pretrained BERT model on the MLM task.

Training data: ENO (Enevældens Nyheder Online), a corpus of news articles, announcements, and advertisements from Danish and Norwegian newspapers from 1762 to 1848. The model was trained on a subset of about 260m words. The data was created using a tailored Transkribus Pylaia - model with an error rate of around 5% on the word level.

Model Description

Architecture: BERT
Pretraining Objective: Masked Language Modeling (MLM)
Sequence Length: 512 tokens
Tokenizer: Custom WordPiece tokenizer

Property	Details
Developed by	CALDISS
Shared by	JohanHeinsen
Model Type	BERT
Language(s) (NLP)	Danish
License	MIT

Model Sources

Repository: https://github.com/CALDISS - AAU/OldNewsBERT
Paper [optional]: In - progress

Uses

Direct Use

This model can be used out - of - the - box for domain - specific masked token prediction.
It can also be used for basic mean - pooled embeddings on similar data, though results may vary as it's only trained on the MLM task using the transformer trainer - framework.

Out - of - Scope Use

Since the model is trained on the ENO dataset, it's not suitable for modern Danish text due to its historical training data.

Bias, Risks, and Limitations

The model has several limitations:

It's limited to the historical period of its training data. Performance will vary when used for masked token prediction on modern Danish or other Scandinavian languages.
There's a bias towards newspaper - style writing. Performance may vary on materials with more figurative language.
There are small biases and risks due to the approximately 5% word - level error in the corpus creation, which persists in the pre - trained model.

Recommendations

The model is based on historical texts with antiquated worldviews like racist, anti - democratic, and patriarchal sentiments. It's unfit for many use cases but can be used to examine such biases in Danish history.

Training Details

Training Data

[More Information Needed]

Training Procedure

Preprocessing

Texts shorter than 35 chars were removed.
Texts with a predetermined amount of German, Latin, or rare words were removed.
Extra whitespaces were removed.

Training Hyperparameters

Training regime: [More Information Needed]
The model was trained for roughly 45 hours on the provided HPC - system.
The MLM - prob was defined as .15

Training arguments:

eval_strategy="steps",
overwrite_output_dir=True,
num_train_epochs=15,
per_device_train_batch_size=16, 
gradient_accumulation_steps=4, 
per_device_eval_batch_size=64, 
logging_steps=500,
learning_rate=5e-5, 
save_steps=1000,
save_total_limit=5,
load_best_model_at_end=True,
metric_for_best_model="eval_loss",
greater_is_better=False,
fp16=torch.cuda.is_available(),
warmup_steps=2000,
warmup_ratio=0.03,
weight_decay=0.01,
lr_scheduler_type="cosine",
dataloader_num_workers=4,
dataloader_pin_memory=True,
save_on_each_node=False,
ddp_find_unused_parameters=False,
optim="adamw_torch",
local_rank=local_rank,

Speeds, Sizes, Times [optional]

[More Information Needed]

Evaluation

Testing Data, Factors & Metrics

Testing Data

[More Information Needed]

Factors

[More Information Needed]

Metrics

Cross - entropy loss (standard for BERT with MLM training).
Avg. Loss on test - set.
Perplexity (calculated based on loss value).

Results

Loss: 2.08
Avg. Loss on test - set: 2.07
Perplexity: 7.65

Model Examination [optional]

[More Information Needed]

Technical Specifications

Model Architecture and Objective

[More Information Needed]

Compute Infrastructure

Hardware:
- Hardware Type: 64 (Intel Xeon Gold 6326), 256 GB memory, 4 Nividia A10
- Hours used: 44 hours 34 minutes
- Cloud Provider: Ucloud SDU
- Compute Region: Cloud services based at University of Southern Denmark, Aarhus University and Aalborg University
Software: Python 3.12.8

Citation

BibTeX

[More Information Needed]

APA

[More Information Needed]

Model Card Authors

Matias Appel (mkap@adm.aau.dk)
Johan Heinsen (heinsen@dps.aau.dk)

Model Card Contact

CALDISS, AAU: www.caldiss.aau.dk

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご