๐ bert-base-romanian-uncased-v1
The BERT base, uncased model for Romanian, trained on a 15GB corpus, version 
๐ Quick Start
How to use
from transformers import AutoTokenizer, AutoModel
import torch
tokenizer = AutoTokenizer.from_pretrained("dumitrescustefan/bert-base-romanian-uncased-v1", do_lower_case=True)
model = AutoModel.from_pretrained("dumitrescustefan/bert-base-romanian-uncased-v1")
input_ids = torch.tensor(tokenizer.encode("Acesta este un test.", add_special_tokens=True)).unsqueeze(0)
outputs = model(input_ids)
last_hidden_states = outputs[0]
โ ๏ธ Important Note
Remember to always sanitize your text! Replace s
and t
cedilla-letters to comma-letters with :
text = text.replace("ลฃ", "ศ").replace("ล", "ศ").replace("ลข", "ศ").replace("ล", "ศ")
because the model was NOT trained on cedilla s
and t
s. If you don't, you will have decreased performance due to <UNK>
s and increased number of tokens per word.
๐ Documentation
Evaluation
Evaluation is performed on Universal Dependencies Romanian RRT UPOS, XPOS and LAS, and on a NER task based on RONEC. Details, as well as more in-depth tests not shown here, are given in the dedicated evaluation page.
The baseline is the Multilingual BERT model bert-base-multilingual-(un)cased
, as at the time of writing it was the only available BERT model that works on Romanian.
Model |
UPOS |
XPOS |
NER |
LAS |
bert-base-multilingual-uncased |
97.65 |
95.72 |
83.91 |
87.65 |
bert-base-romanian-uncased-v1 |
98.18 |
96.84 |
85.26 |
89.61 |
Corpus
The model is trained on the following corpora (stats in the table below are after cleaning):
Corpus |
Lines(M) |
Words(M) |
Chars(B) |
Size(GB) |
OPUS |
55.05 |
635.04 |
4.045 |
3.8 |
OSCAR |
33.56 |
1725.82 |
11.411 |
11 |
Wikipedia |
1.54 |
60.47 |
0.411 |
0.4 |
Total |
90.15 |
2421.33 |
15.867 |
15.2 |
Citation
If you use this model in a research paper, I'd kindly ask you to cite the following paper:
Stefan Dumitrescu, Andrei-Marius Avram, and Sampo Pyysalo. 2020. The birth of Romanian BERT. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4324โ4328, Online. Association for Computational Linguistics.
or, in bibtex:
@inproceedings{dumitrescu-etal-2020-birth,
title = "The birth of {R}omanian {BERT}",
author = "Dumitrescu, Stefan and
Avram, Andrei-Marius and
Pyysalo, Sampo",
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2020",
month = nov,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2020.findings-emnlp.387",
doi = "10.18653/v1/2020.findings-emnlp.387",
pages = "4324--4328",
}
Acknowledgements
- We'd like to thank Sampo Pyysalo from TurkuNLP for helping us out with the compute needed to pretrain the v1.0 BERT models. He's awesome!
๐ License
This project is licensed under the MIT license.