bert-base-romanian-cased-v1 Open-source Model - Accurately Support Romanian Language Processing, Free to Use

Bert Base Romanian Cased V1

Developed by dumitrescustefan

This is a BERT base model for Romanian, case-sensitive, trained on a 15GB corpus.

Large Language Model OtherOpen Source License:MIT #Romanian BERT #Case-sensitive #Natural Language Processing

Downloads 6,466

Release Time : 3/2/2022

Model Overview

This model is a Romanian pre-trained model based on the BERT architecture, suitable for various natural language processing tasks.

Model Features

Romanian-specific

Specially trained for Romanian, offering better performance compared to multilingual models.

Case-sensitive

The model can recognize and process the difference between uppercase and lowercase letters.

Large-scale training data

Trained on a 15GB Romanian corpus, containing data from multiple sources.

Model Capabilities

Text encoding

Language understanding

Named entity recognition

Part-of-speech tagging

Use Cases

Natural Language Processing

Part-of-speech tagging

Perform part-of-speech tagging on Romanian text

Achieved 98.00% accuracy on the UPOS task

Named entity recognition

Identify named entities in Romanian text

Achieved 85.88% F1 score on the RONEC dataset

🚀 bert-base-romanian-cased-v1

This is a base and cased BERT model for Romanian. It was trained on a 15GB corpus, and the current version is .

🚀 Quick Start

💻 Usage Examples

Basic Usage

from transformers import AutoTokenizer, AutoModel
import torch
# load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("dumitrescustefan/bert-base-romanian-cased-v1")
model = AutoModel.from_pretrained("dumitrescustefan/bert-base-romanian-cased-v1")
# tokenize a sentence and run through the model
input_ids = torch.tensor(tokenizer.encode("Acesta este un test.", add_special_tokens=True)).unsqueeze(0)  # Batch size 1
outputs = model(input_ids)
# get encoding
last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple

⚠️ Important Note

Remember to always sanitize your text! Replace s and t cedilla-letters to comma-letters with:

text = text.replace("ţ", "ț").replace("ş", "ș").replace("Ţ", "Ț").replace("Ş", "Ș")

because the model was NOT trained on cedilla s and ts. If you don't, you will have decreased performance due to <UNK>s and increased number of tokens per word.

📚 Documentation

Evaluation

Evaluation is performed on Universal Dependencies Romanian RRT UPOS, XPOS and LAS, and on a NER task based on RONEC. Details, as well as more in-depth tests not shown here, are given in the dedicated evaluation page.

The baseline is the Multilingual BERT model bert-base-multilingual-(un)cased, as at the time of writing it was the only available BERT model that works on Romanian.

Model	UPOS	XPOS	NER	LAS
bert-base-multilingual-cased	97.87	96.16	84.13	88.04
bert-base-romanian-cased-v1	98.00	96.46	85.88	89.69

Corpus

The model is trained on the following corpora (stats in the table below are after cleaning):

Property	Details
Model Type	BERT base, cased model for Romanian
Training Data	OPUS (55.05M lines, 635.04M words, 4.045B chars, 3.8GB), OSCAR (33.56M lines, 1725.82M words, 11.411B chars, 11GB), Wikipedia (1.54M lines, 60.47M words, 0.411B chars, 0.4GB), Total (90.15M lines, 2421.33M words, 15.867B chars, 15.2GB)

Citation

If you use this model in a research paper, I'd kindly ask you to cite the following paper:

Stefan Dumitrescu, Andrei-Marius Avram, and Sampo Pyysalo. 2020. The birth of Romanian BERT. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4324–4328, Online. Association for Computational Linguistics.

or, in bibtex:

@inproceedings{dumitrescu-etal-2020-birth,
    title = "The birth of {R}omanian {BERT}",
    author = "Dumitrescu, Stefan  and
      Avram, Andrei-Marius  and
      Pyysalo, Sampo",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2020",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2020.findings-emnlp.387",
    doi = "10.18653/v1/2020.findings-emnlp.387",
    pages = "4324--4328",
}

Acknowledgements

We'd like to thank Sampo Pyysalo from TurkuNLP for helping us out with the compute needed to pretrain the v1.0 BERT models. He's awesome!

📄 License

This project is licensed under the MIT license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご