SyllaBERTa Open-Source Model - An Efficient Tool for Free Ancient Greek Text Processing

Syllaberta

Developed by Ericu950

SyllaBERTa is an experimental Transformer-based masked language model specifically designed for processing Ancient Greek texts, employing syllable-level tokenization.

Large Language Model

Transformers

Other#Syllable-level tokenization #Ancient Greek processing #Prosodic analysis

Downloads 19

Release Time : 4/25/2025

Model Overview

This model is particularly suitable for tasks involving prosody, meter, and rhyme, featuring a custom-configured RoBERTa architecture.

Model Features

Syllable-level tokenization

Uses syllables instead of words or characters for tokenization, making it particularly suitable for handling the prosodic and metrical features of Ancient Greek.

Custom tokenizer

Supports diphthong merging and Greek orthographic phenomena, enabling accurate syllable segmentation of Ancient Greek texts.

Domain-specific optimization

Designed for classical literature studies, excelling in tasks involving prosodic analysis.

Model Capabilities

Ancient Greek text comprehension

Masked language modeling

Syllable-level text generation

Prosodic analysis

Use Cases

Classical literature research

Prosodic analysis

Analyzing the metrical structure of Ancient Greek poetry

Accurately identifies syllable patterns and predicts missing syllables

Text restoration

Restoring missing or damaged sections in ancient texts

Predicts the most likely syllable sequences based on context

Linguistics education

Language learning aid

Helping students understand the syllabic structure of Ancient Greek

Provides syllable-level decomposition and prediction

🚀 SyllaBERTa: A Syllable-Based RoBERTa for Ancient Greek

SyllaBERTa is an experimental Transformer-based masked language model (MLM) trained on Ancient Greek texts. It tokenizes at the syllable level and is specifically designed to handle tasks related to prosody, meter, and rhyme.

✨ Features

Trained on Ancient Greek texts with syllable-level tokenization.
Customized for prosody, meter, and rhyme tasks.

📦 Installation

No installation steps were provided in the original document, so this section is skipped.

📚 Documentation

Model Summary

Property	Details
Base architecture	RoBERTa (custom configuration)
Vocabulary size	42,042 syllabic tokens
Hidden size	768
Number of layers	12
Attention heads	12
Intermediate size	3,072
Max sequence length	514
Pretraining objective	Masked Language Modeling (MLM)
Optimizer	AdamW
Loss function	CrossEntropy with 15% token masking probability

The tokenizer is a custom subclass of PreTrainedTokenizer, which operates on syllables instead of words or characters. It:

Maps each syllable to an ID.
Supports diphthong merging and Greek orthographic phenomena.
Uses space-separated syllable tokens.

Example tokenization: Input:
Κατέβην χθὲς εἰς Πειραιᾶ Tokens:
['κα', 'τέ', 'βην', 'χθὲ', 'σεἰσ', 'πει', 'ραι', 'ᾶ']

Note that words are fused at the syllabic level.

💻 Usage Examples

Basic Usage

from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("Ericu950/SyllaBERTa", trust_remote_code=True)
model = AutoModelForMaskedLM.from_pretrained("Ericu950/SyllaBERTa", trust_remote_code=True)

# Encode a sentence
text = "Κατέβην χθὲς εἰς Πειραιᾶ μετὰ Γλαύκωνος τοῦ Ἀρίστωνος"
tokens = tokenizer.tokenize(text)
print(tokens)

# Insert a mask at random
import random
tokens[random.randint(0, len(tokens)-1)] = tokenizer.mask_token
masked_text = tokenizer.convert_tokens_to_string(tokens)

# Predict masked token
inputs = tokenizer(masked_text, return_tensors="pt", padding=True, truncation=True)
inputs.pop("token_type_ids", None)
with torch.no_grad():
    outputs = model(**inputs)

# Fetch prediction
logits = outputs.logits
mask_token_index = (inputs['input_ids'] == tokenizer.mask_token_id).nonzero(as_tuple=True)[1]
top_tokens = logits[0, mask_token_index].topk(5, dim=-1).indices.squeeze(0)
predicted = tokenizer.convert_ids_to_tokens(top_tokens.tolist())

print("Top predictions:", predicted)

It should print:

Original tokens: ['κα', 'τέ', 'βην', 'χθὲ', 'σεἰσ', 'πει', 'ραι', 'ᾶ', 'με', 'τὰγ', 'λαύ', 'κω', 'νοσ', 'τοῦ', 'ἀ', 'ρίσ', 'τω', 'νοσ']

Masked at position 6
Masked text: κα τέ βην χθὲ σεἰσ πει [MASK] ᾶ με τὰγ λαύ κω νοσ τοῦ ἀ ρίσ τω νοσ

Top 5 predictions for masked token:
ραι          (score: 23.12)
ρα           (score: 14.69)
ραισ         (score: 12.63)
σαι          (score: 12.43)
ρη           (score: 12.26)

📄 License

This project is under the MIT License.

👥 Authors

This work is part of ongoing research by Eric Cullhed (Uppsala University) and Albin Thörn Cleland (Lund University).

🙏 Acknowledgements

The computations were enabled by resources provided by the National Academic Infrastructure for Supercomputing in Sweden (NAISS), partially funded by the Swedish Research Council through grant agreement no. 2022-06725.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご