đ RobeCzech Model Card
RobeCzech is a monolingual RoBERTa language representation model trained on Czech data. It can be used for Fill-Mask tasks and various downstream NLP tasks.
đ Quick Start
Use the code below to get started with the model.
Click to expand
from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("ufal/robeczech-base")
model = AutoModelForMaskedLM.from_pretrained("ufal/robeczech-base")
⨠Features
- Monolingual Model: Specifically trained on Czech data, suitable for Czech language processing.
- Multiple NLP Tasks: Can be used for Fill-Mask tasks, as well as downstream tasks such as morphological tagging, lemmatization, dependency parsing, named entity recognition, and semantic parsing.
đĻ Installation
No specific installation steps are provided in the original document.
đģ Usage Examples
Basic Usage
from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("ufal/robeczech-base")
model = AutoModelForMaskedLM.from_pretrained("ufal/robeczech-base")
đ Documentation
Version History
- version 1.1: Released in Jan 2024. The tokenizer was modified by removing a hole and mapping all tokens to a unique ID. The model parameters were mostly kept the same, but the embeddings were enlarged and the pooler was dropped. The weights of version 1.1 are not compatible with the configuration of version 1.0 and vice versa.
- version 1.0: Initial version released in May 2021, with some tokenization issues. You can load a pretrained model, configuration, or a tokenizer of version 1.0 using
from_pretrained("ufal/robeczech-base", revision="v1.0")
.
Model Details
- Developed by: Institute of Formal and Applied Linguistics, Charles University, Prague (UFAL)
- Shared by: Hugging Face and LINDAT/CLARIAH-CZ
- Model type: Fill-Mask
- Language(s) (NLP): cs
- License: cc-by-nc-sa-4.0
- Model Architecture: RoBERTa
- Resources for more information:
Uses
- Direct Use: Fill-Mask tasks.
- Downstream Use: Morphological tagging and lemmatization, dependency parsing, named entity recognition, and semantic parsing.
Bias, Risks, and Limitations
Predictions generated by the model may include disturbing and harmful stereotypes across protected classes, identity characteristics, and sensitive, social, and occupational groups.
Recommendations
Users (both direct and downstream) should be made aware of the risks, biases, and limitations of the model. More information is needed for further recommendations.
Training Details
Training Data
The model was trained on a collection of publicly available texts, including SYN v4, Czes, documents from the Czech part of the web corpus.W2C, and plain texts extracted from Czech Wikipedia dump 20201020. In total, the corpora contain 4,917M tokens.
Training Procedure
- Preprocessing: The texts are tokenized into subwords with a byte-level BPE (BBPE) tokenizer, and the vocabulary size is limited to 52,000 items.
- Speeds, Sizes, Times: The training batch size is 8,192, and each training batch consists of sentences sampled contiguously. The Adam optimizer is used to minimize the masked language-modeling objective.
- Software Used: The Fairseq implementation was used for training.
Evaluation
Testing Data, Factors & Metrics
The model was evaluated in five NLP tasks, three using frozen contextualized word embeddings and two using fine-tuning.
Results
Model |
Morphosynt PDT3.5 (POS) (LAS) |
Morphosynt UD2.3 (XPOS) (LAS) |
NER CNEC1.1 (nested) (flat) |
Semant. PTG (Avg) (F1) |
RobeCzech |
98.50 91.42 |
98.31 93.77 |
87.82 87.47 |
92.36 80.13 |
Environmental Impact
- Hardware Type: 8 QUADRO P5000 GPU
- Hours used: 2190 (~3 months)
Citation
@InProceedings{10.1007/978-3-030-83527-9_17,
author={Straka, Milan and N{\'a}plava, Jakub and Strakov{\'a}, Jana and Samuel, David},
editor={Ek{\v{s}}tein, Kamil and P{\'a}rtl, Franti{\v{s}}ek and Konop{\'i}k, Miloslav},
title={{RobeCzech: Czech RoBERTa, a Monolingual Contextualized Language Representation Model}},
booktitle="Text, Speech, and Dialogue",
year="2021",
publisher="Springer International Publishing",
address="Cham",
pages="197--209",
isbn="978-3-030-83527-9"
}
đ§ Technical Details
The model uses a RoBERTa architecture and is trained on a large corpus of Czech texts. The tokenizer is a byte-level BPE (BBPE) tokenizer with a vocabulary size of 52,000 items. The training batch size is 8,192, and the Adam optimizer is used to minimize the masked language-modeling objective.
đ License
The model is licensed under cc-by-nc-sa-4.0.