🚀 SyllaBERTa: A Syllable-Based RoBERTa for Ancient Greek
SyllaBERTa is an experimental Transformer-based masked language model (MLM) trained on Ancient Greek texts. It tokenizes at the syllable level and is specifically designed to handle tasks related to prosody, meter, and rhyme.
✨ Features
- Trained on Ancient Greek texts with syllable-level tokenization.
- Customized for prosody, meter, and rhyme tasks.
📦 Installation
No installation steps were provided in the original document, so this section is skipped.
📚 Documentation
Model Summary
Property |
Details |
Base architecture |
RoBERTa (custom configuration) |
Vocabulary size |
42,042 syllabic tokens |
Hidden size |
768 |
Number of layers |
12 |
Attention heads |
12 |
Intermediate size |
3,072 |
Max sequence length |
514 |
Pretraining objective |
Masked Language Modeling (MLM) |
Optimizer |
AdamW |
Loss function |
CrossEntropy with 15% token masking probability |
The tokenizer is a custom subclass of PreTrainedTokenizer
, which operates on syllables instead of words or characters. It:
- Maps each syllable to an ID.
- Supports diphthong merging and Greek orthographic phenomena.
- Uses space-separated syllable tokens.
Example tokenization:
Input:
Κατέβην χθὲς εἰς Πειραιᾶ
Tokens:
['κα', 'τέ', 'βην', 'χθὲ', 'σεἰσ', 'πει', 'ραι', 'ᾶ']
Note that words are fused at the syllabic level.
💻 Usage Examples
Basic Usage
from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("Ericu950/SyllaBERTa", trust_remote_code=True)
model = AutoModelForMaskedLM.from_pretrained("Ericu950/SyllaBERTa", trust_remote_code=True)
text = "Κατέβην χθὲς εἰς Πειραιᾶ μετὰ Γλαύκωνος τοῦ Ἀρίστωνος"
tokens = tokenizer.tokenize(text)
print(tokens)
import random
tokens[random.randint(0, len(tokens)-1)] = tokenizer.mask_token
masked_text = tokenizer.convert_tokens_to_string(tokens)
inputs = tokenizer(masked_text, return_tensors="pt", padding=True, truncation=True)
inputs.pop("token_type_ids", None)
with torch.no_grad():
outputs = model(**inputs)
logits = outputs.logits
mask_token_index = (inputs['input_ids'] == tokenizer.mask_token_id).nonzero(as_tuple=True)[1]
top_tokens = logits[0, mask_token_index].topk(5, dim=-1).indices.squeeze(0)
predicted = tokenizer.convert_ids_to_tokens(top_tokens.tolist())
print("Top predictions:", predicted)
It should print:
Original tokens: ['κα', 'τέ', 'βην', 'χθὲ', 'σεἰσ', 'πει', 'ραι', 'ᾶ', 'με', 'τὰγ', 'λαύ', 'κω', 'νοσ', 'τοῦ', 'ἀ', 'ρίσ', 'τω', 'νοσ']
Masked at position 6
Masked text: κα τέ βην χθὲ σεἰσ πει [MASK] ᾶ με τὰγ λαύ κω νοσ τοῦ ἀ ρίσ τω νοσ
Top 5 predictions for masked token:
ραι (score: 23.12)
ρα (score: 14.69)
ραισ (score: 12.63)
σαι (score: 12.43)
ρη (score: 12.26)
📄 License
This project is under the MIT License.
👥 Authors
This work is part of ongoing research by Eric Cullhed (Uppsala University) and Albin Thörn Cleland (Lund University).
🙏 Acknowledgements
The computations were enabled by resources provided by the National Academic Infrastructure for Supercomputing in Sweden (NAISS), partially funded by the Swedish Research Council through grant agreement no. 2022-06725.