🚀 ScandiBERT
A Scandinavian BERT model trained on a large collection of Danish, Faroese, Icelandic, Norwegian and Swedish text, currently ranking highest on the ScandEval leaderboard.
🚀 Quick Start
Note: The model was updated on 2022 - 09 - 27.
The model was trained on the data shown in the table below. The batch size was 8.8k, and the model was trained for 72 epochs on 24 V100 cards for about 2 weeks.
Language |
Data |
Size |
Icelandic |
See IceBERT paper |
16 GB |
Danish |
Danish Gigaword Corpus (incl Twitter) |
4.7 GB |
Norwegian |
NCC corpus |
42 GB |
Swedish |
Swedish Gigaword Corpus |
3.4 GB |
Faroese |
FC3 + Sosialurinn + Bible |
69 MB |
Note: An earlier half - trained model was uploaded here but has since been removed. The model has been updated.
This is a Scandinavian BERT model trained on a large collection of Danish, Faroese, Icelandic, Norwegian and Swedish text. It currently holds the top rank on the ScandEval leaderboard at https://scandeval.github.io/pretrained/.
✨ Features
- Supports multiple Scandinavian languages including Icelandic, Danish, Norwegian, Swedish, and Faroese.
- Achieves the highest ranking on the ScandEval leaderboard.
📚 Documentation
If you find this model useful, please cite:
@inproceedings{snaebjarnarson-etal-2023-transfer,
title = "{T}ransfer to a Low-Resource Language via Close Relatives: The Case Study on Faroese",
author = "Snæbjarnarson, Vésteinn and
Simonsen, Annika and
Glavaš, Goran and
Vulić, Ivan",
booktitle = "Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)",
month = "may 22--24",
year = "2023",
address = "Tórshavn, Faroe Islands",
publisher = {Link{\"o}ping University Electronic Press, Sweden},
}
📄 License
This model is released under the AGPL - 3.0 license.
Additional Information
Supported Languages
- Icelandic
- Danish
- Swedish
- Norwegian
- Faroese
Widget Examples
- "Fina lilla, jag vill inte bliva stur."
- "Nu ved jeg, at du frygter og end ikke vil nægte mig din eneste søn.."
- "Það er vorhret á, napur vindur sem hvín."
- "Ja, Gud signi, mítt land."
- "Alle dyrene i må være venner."
Tags
- roberta
- icelandic
- norwegian
- faroese
- danish
- swedish
- masked - lm
- pytorch
Datasets
- vesteinn/FC3
- vesteinn/IC3
- mideind/icelandic - common - crawl - corpus - IC3
- NbAiLab/NCC
- DDSC/partial - danish - gigaword - no - twitter