๐ bert-base-romanian-cased-v1
This is a base and cased BERT model for Romanian. It was trained on a 15GB corpus, and the current version is
.
๐ Quick Start
๐ป Usage Examples
Basic Usage
from transformers import AutoTokenizer, AutoModel
import torch
tokenizer = AutoTokenizer.from_pretrained("dumitrescustefan/bert-base-romanian-cased-v1")
model = AutoModel.from_pretrained("dumitrescustefan/bert-base-romanian-cased-v1")
input_ids = torch.tensor(tokenizer.encode("Acesta este un test.", add_special_tokens=True)).unsqueeze(0)
outputs = model(input_ids)
last_hidden_states = outputs[0]
โ ๏ธ Important Note
Remember to always sanitize your text! Replace s
and t
cedilla-letters to comma-letters with:
text = text.replace("ลฃ", "ศ").replace("ล", "ศ").replace("ลข", "ศ").replace("ล", "ศ")
because the model was NOT trained on cedilla s
and t
s. If you don't, you will have decreased performance due to <UNK>
s and increased number of tokens per word.
๐ Documentation
Evaluation
Evaluation is performed on Universal Dependencies Romanian RRT UPOS, XPOS and LAS, and on a NER task based on RONEC. Details, as well as more in-depth tests not shown here, are given in the dedicated evaluation page.
The baseline is the Multilingual BERT model bert-base-multilingual-(un)cased
, as at the time of writing it was the only available BERT model that works on Romanian.
Model |
UPOS |
XPOS |
NER |
LAS |
bert-base-multilingual-cased |
97.87 |
96.16 |
84.13 |
88.04 |
bert-base-romanian-cased-v1 |
98.00 |
96.46 |
85.88 |
89.69 |
Corpus
The model is trained on the following corpora (stats in the table below are after cleaning):
Property |
Details |
Model Type |
BERT base, cased model for Romanian |
Training Data |
OPUS (55.05M lines, 635.04M words, 4.045B chars, 3.8GB), OSCAR (33.56M lines, 1725.82M words, 11.411B chars, 11GB), Wikipedia (1.54M lines, 60.47M words, 0.411B chars, 0.4GB), Total (90.15M lines, 2421.33M words, 15.867B chars, 15.2GB) |
Citation
If you use this model in a research paper, I'd kindly ask you to cite the following paper:
Stefan Dumitrescu, Andrei-Marius Avram, and Sampo Pyysalo. 2020. The birth of Romanian BERT. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4324โ4328, Online. Association for Computational Linguistics.
or, in bibtex:
@inproceedings{dumitrescu-etal-2020-birth,
title = "The birth of {R}omanian {BERT}",
author = "Dumitrescu, Stefan and
Avram, Andrei-Marius and
Pyysalo, Sampo",
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2020",
month = nov,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2020.findings-emnlp.387",
doi = "10.18653/v1/2020.findings-emnlp.387",
pages = "4324--4328",
}
Acknowledgements
- We'd like to thank Sampo Pyysalo from TurkuNLP for helping us out with the compute needed to pretrain the v1.0 BERT models. He's awesome!
๐ License
This project is licensed under the MIT license.