robeczech-base Open-source Language Model - Free for Czech Intelligent Text Processing and Analysis

Robeczech Base

Developed by ufal

RobeCzech is a monolingual RoBERTa language representation model trained on Czech data, developed by the Institute of Formal and Applied Linguistics at Charles University in Prague.

Large Language Model

Transformers

Other#Czech RoBERTa #Masked Language Modeling #Morphosyntactic Analysis

Downloads 2,911

Release Time : 3/2/2022

Model Overview

This model is primarily used for masked language modeling tasks, supports Czech text processing, and is suitable for various natural language processing tasks.

Model Features

Improved Tokenizer

Version 1.1 includes significant improvements to the tokenizer, filling numbering gaps and assigning unique IDs to all tokens, enhancing model stability and compatibility.

Czech Language Optimization

Specially trained on Czech data, optimizing language representation capabilities for Czech-related natural language processing tasks.

Document Structure Preservation

Training preserves complete document structure, aiding the model in understanding contextual information.

Model Capabilities

Masked Language Modeling

Morphological Tagging

Lemmatization

Dependency Parsing

Named Entity Recognition

Semantic Parsing

Use Cases

Natural Language Processing

Morphological Analysis and Lemmatization

Performs Czech morphological analysis and lemmatization using frozen word embeddings.

Tagging accuracy reaches 98.50 (POS tagging) and 91.42 (fine-grained POS).

Named Entity Recognition

Identifies named entities in Czech text.

F1 scores reach 87.82 (nested) and 87.47 (flat).

Semantic Parsing

Performs semantic parsing on Czech text.

Average F1 score reaches 92.36.

🚀 RobeCzech Model Card

RobeCzech is a monolingual RoBERTa language representation model trained on Czech data. It can be used for Fill-Mask tasks and various downstream NLP tasks.

🚀 Quick Start

Use the code below to get started with the model.

Click to expand

from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("ufal/robeczech-base")

model = AutoModelForMaskedLM.from_pretrained("ufal/robeczech-base")

✨ Features

Monolingual Model: Specifically trained on Czech data, suitable for Czech language processing.
Multiple NLP Tasks: Can be used for Fill-Mask tasks, as well as downstream tasks such as morphological tagging, lemmatization, dependency parsing, named entity recognition, and semantic parsing.

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("ufal/robeczech-base")
model = AutoModelForMaskedLM.from_pretrained("ufal/robeczech-base")

📚 Documentation

Version History

version 1.1: Released in Jan 2024. The tokenizer was modified by removing a hole and mapping all tokens to a unique ID. The model parameters were mostly kept the same, but the embeddings were enlarged and the pooler was dropped. The weights of version 1.1 are not compatible with the configuration of version 1.0 and vice versa.
version 1.0: Initial version released in May 2021, with some tokenization issues. You can load a pretrained model, configuration, or a tokenizer of version 1.0 using from_pretrained("ufal/robeczech-base", revision="v1.0").

Model Details

Developed by: Institute of Formal and Applied Linguistics, Charles University, Prague (UFAL)
Shared by: Hugging Face and LINDAT/CLARIAH-CZ
Model type: Fill-Mask
Language(s) (NLP): cs
License: cc-by-nc-sa-4.0
Model Architecture: RoBERTa
Resources for more information:
- RobeCzech: Czech RoBERTa, a Monolingual Contextualized Language Representation Model
  - arXiv preprint is also available

Uses

Direct Use: Fill-Mask tasks.
Downstream Use: Morphological tagging and lemmatization, dependency parsing, named entity recognition, and semantic parsing.

Bias, Risks, and Limitations

Predictions generated by the model may include disturbing and harmful stereotypes across protected classes, identity characteristics, and sensitive, social, and occupational groups.

Recommendations

Users (both direct and downstream) should be made aware of the risks, biases, and limitations of the model. More information is needed for further recommendations.

Training Details

Training Data

The model was trained on a collection of publicly available texts, including SYN v4, Czes, documents from the Czech part of the web corpus.W2C, and plain texts extracted from Czech Wikipedia dump 20201020. In total, the corpora contain 4,917M tokens.

Training Procedure

Preprocessing: The texts are tokenized into subwords with a byte-level BPE (BBPE) tokenizer, and the vocabulary size is limited to 52,000 items.
Speeds, Sizes, Times: The training batch size is 8,192, and each training batch consists of sentences sampled contiguously. The Adam optimizer is used to minimize the masked language-modeling objective.
Software Used: The Fairseq implementation was used for training.

Evaluation

Testing Data, Factors & Metrics

The model was evaluated in five NLP tasks, three using frozen contextualized word embeddings and two using fine-tuning.

Results

Model	Morphosynt PDT3.5 (POS) (LAS)	Morphosynt UD2.3 (XPOS) (LAS)	NER CNEC1.1 (nested) (flat)	Semant. PTG (Avg) (F1)
RobeCzech	98.50 91.42	98.31 93.77	87.82 87.47	92.36 80.13

Environmental Impact

Hardware Type: 8 QUADRO P5000 GPU
Hours used: 2190 (~3 months)

Citation

@InProceedings{10.1007/978-3-030-83527-9_17,
  author={Straka, Milan and N{\'a}plava, Jakub and Strakov{\'a}, Jana and Samuel, David},
  editor={Ek{\v{s}}tein, Kamil and P{\'a}rtl, Franti{\v{s}}ek and Konop{\'i}k, Miloslav},
  title={{RobeCzech: Czech RoBERTa, a Monolingual Contextualized Language Representation Model}},
  booktitle="Text, Speech, and Dialogue",
  year="2021",
  publisher="Springer International Publishing",
  address="Cham",
  pages="197--209",
  isbn="978-3-030-83527-9"
}

🔧 Technical Details

The model uses a RoBERTa architecture and is trained on a large corpus of Czech texts. The tokenizer is a byte-level BPE (BBPE) tokenizer with a vocabulary size of 52,000 items. The training batch size is 8,192, and the Adam optimizer is used to minimize the masked language-modeling objective.

📄 License

The model is licensed under cc-by-nc-sa-4.0.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご