roberta-el-news Open-source Model - Implement Masked Language Modeling Based on Greek News Data

Home

Roberta El News

Developed by cvcio

RoBERTa model pretrained on Greek news data, specializing in masked language modeling tasks

Large Language Model

Transformers

OtherOpen Source License:Gpl-3.0 #Greek News Analysis #Masked Language Modeling #High-precision NER

Downloads 51

Release Time : 3/2/2022

Model Overview

This is a RoBERTa model pretrained on Greek news data using masked language modeling (MLM) objectives, suitable for Greek text processing tasks.

Model Features

Trained on Greek News Data

Pretrained on 8 million Greek news articles (approximately 160 million sentences) from 2016-2021

Preserves Diacritics

Retains all diacritical marks when processing Greek text

Case Insensitive

Model is insensitive to text casing

Efficient Tokenization

Uses BPE tokenizer with a vocabulary of 50,265

Model Capabilities

Greek Text Understanding

Masked Language Prediction

Named Entity Recognition (with fine-tuning)

Use Cases

News Analysis

Political News Analysis

Analyzing key information in Greek political news

Successfully predicted key terms in political reports in examples

Text Completion

News Text Completion

Predicting masked words in news texts

Accurately predicted words like 'public' and 'release' in examples

🚀 RoBERTa Greek base model

This is a pre - trained model on the Greek language using the Masked Language Modeling (MLM) objective with Hugging Face's Transformers library. It is case - insensitive and retains all Greek diacritics.

🚀 Quick Start

You can use this model directly with a pipeline for masked language modeling:

# example url 
# https://www.news247.gr/politiki/misologa-maximoy-gia-tin-ekthesi-tsiodra-lytra-gia-ti-thnitotita-ektos-meth.9462425.html 
# not present in train/eval set
from transformers import pipeline
pipe = pipeline('fill-mask', model='cvcio/roberta-el-news')
pipe(
    'Η κυβέρνηση μουδιασμένη από τη <mask> της έκθεσης Τσιόδρα-Λύτρα, '
    'επιχειρεί χωρίς να απαντά ουσιαστικά να ρίξει ευθύνες στον ΣΥΡΙΖΑ, που κυβερνούσε πριν... 2 χρόνια.'
)
# outputs
[
    {
        'sequence': 'Η κυβέρνηση μουδιασμένη από τη δημοσιοποίηση της έκθεσης Τσιόδρα-Λύτρα, επιχειρεί χωρίς να απαντά ουσιαστικά να ρίξει ευθύνες στον ΣΥΡΙΖΑ, που κυβερνούσε πριν... 2 χρόνια.', 
        'score': 0.5881184339523315, 'token': 20235, 'token_str': ' δημοσιοποίηση'
    }, 
    {
        'sequence': 'Η κυβέρνηση μουδιασμένη από τη δημοσίευση της έκθεσης Τσιόδρα-Λύτρα, επιχειρεί χωρίς να απαντά ουσιαστικά να ρίξει ευθύνες στον ΣΥΡΙΖΑ, που κυβερνούσε πριν... 2 χρόνια.', 
        'score': 0.05952141433954239, 'token': 9696, 'token_str': ' δημοσίευση'
    }, 
    {
        'sequence': 'Η κυβέρνηση μουδιασμένη από τη διαχείριση της έκθεσης Τσιόδρα-Λύτρα, επιχειρεί χωρίς να απαντά ουσιαστικά να ρίξει ευθύνες στον ΣΥΡΙΖΑ, που κυβερνούσε πριν... 2 χρόνια.', 
        'score': 0.029887061566114426, 'token': 4315, 'token_str': ' διαχείριση'
    }, 
    {
        'sequence': 'Η κυβέρνηση μουδιασμένη από τη διαρροή της έκθεσης Τσιόδρα-Λύτρα, επιχειρεί χωρίς να απαντά ουσιαστικά να ρίξει ευθύνες στον ΣΥΡΙΖΑ, που κυβερνούσε πριν... 2 χρόνια.', 
        'score': 0.022848669439554214, 'token': 24940, 'token_str': ' διαρροή'
    }, 
    {
        'sequence': 'Η κυβέρνηση μουδιασμένη από τη ματαίωση της έκθεσης Τσιόδρα-Λύτρα, επιχειρεί χωρίς να απαντά ουσιαστικά να ρίξει ευθύνες στον ΣΥΡΙΖΑ, που κυβερνούσε πριν... 2 χρόνια.', 
        'score': 0.01729060709476471, 'token': 46913, 'token_str': ' ματαίωση'
    }
]

✨ Features

Pretrained on Greek language with Masked Language Modeling (MLM) objective.
Case - insensitive and retains Greek diacritics.

📦 Installation

No specific installation steps are provided in the original document, so this section is skipped.

💻 Usage Examples

Basic Usage

# example url 
# https://www.news247.gr/politiki/misologa-maximoy-gia-tin-ekthesi-tsiodra-lytra-gia-ti-thnitotita-ektos-meth.9462425.html 
# not present in train/eval set
from transformers import pipeline
pipe = pipeline('fill-mask', model='cvcio/roberta-el-news')
pipe(
    'Η κυβέρνηση μουδιασμένη από τη <mask> της έκθεσης Τσιόδρα-Λύτρα, '
    'επιχειρεί χωρίς να απαντά ουσιαστικά να ρίξει ευθύνες στον ΣΥΡΙΖΑ, που κυβερνούσε πριν... 2 χρόνια.'
)

📚 Documentation

Training data

The model was pretrained on 8 million unique news articles (~ approx 160M sentences, 33GB of text), collected with MediaWatch, from October 2016 upto December 2021.

Preprocessing

The texts are tokenized using a byte version of Byte - Pair Encoding (BPE) and a vocabulary size of 50,265. During the preprocessing we only unescaped html text to the corresponding Unicode characters (ex. & => &).

Pretraining

The model was pretrained using an NVIDIA A10 GPU for 3 epochs (~ approx 760K steps, 182 hours) with a batch size of 14 (x2 gradient accumulation steps = 28) and a sequence length of 512 tokens. The optimizer used is Adam with a learning rate of 5e - 5, and linear decay of the learning rate.

Training results

epochs	steps	train/train_loss	train/loss	eval/loss
3	765,414	0.3960	1.2356	0.9028

Evaluation results

The model fine - tuned on ner task using the elNER dataset and achieved the following results:

task	epochs	lr	batch	dataset	precision	recall	f1	accuracy
ner	5	1e - 5	16/16	elNER4	0.8954	0.9280	0.9114	0.9872
ner	5	1e - 4	16/16	elNER18	0.9069	0.9268	0.9168	0.9823

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 5e - 5
train_batch_size: 14
eval_batch_size: 8
seed: 42
gradient_accumulation_steps: 2
total_train_batch_size: 28
optimizer: Adam with betas=(0.9,0.999) and epsilon = 1e - 08
lr_scheduler_type: linear
num_epochs: 3.0

Framework versions

Transformers 4.13.0
Pytorch 1.9.0+cu111
Datasets 1.16.1
Tokenizers 0.10.3

🔧 Technical Details

The model uses a byte - version of Byte - Pair Encoding (BPE) for tokenization with a vocabulary size of 50,265. It was pretrained on a large amount of Greek news data using an NVIDIA A10 GPU for 3 epochs with specific hyperparameters such as batch size, learning rate, etc.

📄 License

This project is licensed under the GPL - 3.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご