🚀 Model Card for Model ID
DA-Bert_Old_News_V1 is the first version of a transformer. It was trained on Danish historical texts from the period of Danish Absolutism (1660 - 1849) by researchers at Aalborg University. The model aims to create a domain - specific solution for extracting meaning from texts that differ significantly from contemporary Danish.
✨ Features
- Domain - specific masked token prediction.
- Embedding extraction for semantic search.
- Suitable for further fine - tuning.
📦 Installation
No installation details are provided in the original document.
💻 Usage Examples
Basic Usage
📚 Documentation
Model Details
This is a pretrained BERT model on the MLM task.
- Training data: ENO (Enevældens Nyheder Online), a corpus of news articles, announcements, and advertisements from Danish and Norwegian newspapers from 1762 to 1848. The model was trained on a subset of about 260m words. The data was created using a tailored Transkribus Pylaia - model with an error rate of around 5% on the word level.
Model Description
- Architecture: BERT
- Pretraining Objective: Masked Language Modeling (MLM)
- Sequence Length: 512 tokens
- Tokenizer: Custom WordPiece tokenizer
Property |
Details |
Developed by |
CALDISS |
Shared by |
JohanHeinsen |
Model Type |
BERT |
Language(s) (NLP) |
Danish |
License |
MIT |
Model Sources
- Repository: https://github.com/CALDISS - AAU/OldNewsBERT
- Paper [optional]: In - progress
Uses
Direct Use
- This model can be used out - of - the - box for domain - specific masked token prediction.
- It can also be used for basic mean - pooled embeddings on similar data, though results may vary as it's only trained on the MLM task using the transformer trainer - framework.
Out - of - Scope Use
Since the model is trained on the ENO dataset, it's not suitable for modern Danish text due to its historical training data.
Bias, Risks, and Limitations
The model has several limitations:
- It's limited to the historical period of its training data. Performance will vary when used for masked token prediction on modern Danish or other Scandinavian languages.
- There's a bias towards newspaper - style writing. Performance may vary on materials with more figurative language.
- There are small biases and risks due to the approximately 5% word - level error in the corpus creation, which persists in the pre - trained model.
Recommendations
The model is based on historical texts with antiquated worldviews like racist, anti - democratic, and patriarchal sentiments. It's unfit for many use cases but can be used to examine such biases in Danish history.
Training Details
Training Data
[More Information Needed]
Training Procedure
Preprocessing
- Texts shorter than 35 chars were removed.
- Texts with a predetermined amount of German, Latin, or rare words were removed.
- Extra whitespaces were removed.
Training Hyperparameters
- Training regime: [More Information Needed]
- The model was trained for roughly 45 hours on the provided HPC - system.
- The MLM - prob was defined as .15
Training arguments:
eval_strategy="steps",
overwrite_output_dir=True,
num_train_epochs=15,
per_device_train_batch_size=16,
gradient_accumulation_steps=4,
per_device_eval_batch_size=64,
logging_steps=500,
learning_rate=5e-5,
save_steps=1000,
save_total_limit=5,
load_best_model_at_end=True,
metric_for_best_model="eval_loss",
greater_is_better=False,
fp16=torch.cuda.is_available(),
warmup_steps=2000,
warmup_ratio=0.03,
weight_decay=0.01,
lr_scheduler_type="cosine",
dataloader_num_workers=4,
dataloader_pin_memory=True,
save_on_each_node=False,
ddp_find_unused_parameters=False,
optim="adamw_torch",
local_rank=local_rank,
Speeds, Sizes, Times [optional]
[More Information Needed]
Evaluation
Testing Data, Factors & Metrics
Testing Data
[More Information Needed]
Factors
[More Information Needed]
Metrics
- Cross - entropy loss (standard for BERT with MLM training).
- Avg. Loss on test - set.
- Perplexity (calculated based on loss value).
Results
- Loss: 2.08
- Avg. Loss on test - set: 2.07
- Perplexity: 7.65
Model Examination [optional]
[More Information Needed]
Technical Specifications
Model Architecture and Objective
[More Information Needed]
Compute Infrastructure
- Hardware:
- Hardware Type: 64 (Intel Xeon Gold 6326), 256 GB memory, 4 Nividia A10
- Hours used: 44 hours 34 minutes
- Cloud Provider: Ucloud SDU
- Compute Region: Cloud services based at University of Southern Denmark, Aarhus University and Aalborg University
- Software: Python 3.12.8
Citation
BibTeX
[More Information Needed]
APA
[More Information Needed]
Model Card Authors
- Matias Appel (mkap@adm.aau.dk)
- Johan Heinsen (heinsen@dps.aau.dk)
Model Card Contact
CALDISS, AAU: www.caldiss.aau.dk