đ Meltemi: A large foundation Language Model for the Greek language
Meltemi is the first Greek Large Language Model (LLM) trained by the Institute for Language and Speech Processing at Athena Research & Innovation Center. Built on Mistral-7B, it extends capabilities for Greek via continual pretraining on a large Greek text corpus. We offer Meltemi-7B-v1 and an instruction fine - tuned version Meltemi-7B-Instruct-v1.
đ Quick Start
Newer Version Notice
This model has been superseded by a newer version (v1.5) here.
Model Introduction
We introduce Meltemi, leveraging Mistral-7B and enhancing it for Greek through pretraining on a large corpus of high - quality Greek texts.

⨠Features
- Vocabulary Extension: Extended the Mistral - 7B tokenizer with Greek tokens.
- Context Length: 8192 context length.
- Pretraining Extension: Extended the pretraining of Mistral - 7B for Greek proficiency using a large corpus of about 40 billion tokens.
- The corpus includes 28.5 billion monolingual Greek tokens from public resources, 10.5 billion monolingual English tokens, and 600 million Greek - English parallel data tokens.
- The corpus has been processed, filtered, and deduplicated for data quality.
Sub - corpus |
# Tokens |
Percentage |
Greek |
28,555,902,360 |
72.0% |
English |
10,478,414,033 |
26.4% |
Parallel |
633,816,023 |
1.6% |
Total |
39,668,132,416 |
100% |
đģ Usage Examples
Basic Usage
Please make sure that the BOS token is always included in the tokenized prompts. This might not be the default setting in all evaluation or fine - tuning frameworks.
đ Documentation
Evaluation
The evaluation suite includes 6 test sets integrated with [lm - eval - harness](https://github.com/EleutherAI/lm - evaluation - harness).
The evaluation is performed in a few - shot setting, consistent with the Open LLM leaderboard. The training enhances performance across all Greek test sets by a +14.9% average improvement.
|
Medical MCQA EL (15 - shot) |
Belebele EL (5 - shot) |
HellaSwag EL (10 - shot) |
ARC - Challenge EL (25 - shot) |
TruthfulQA MC2 EL (0 - shot) |
MMLU EL (5 - shot) |
Average |
Mistral 7B |
29.8% |
45.0% |
36.5% |
27.1% |
45.8% |
35% |
36.5% |
Meltemi 7B |
41.0% |
63.6% |
61.6% |
43.2% |
52.1% |
47% |
51.4% |
Ethical Considerations
This model has not been aligned with human preferences, and therefore might generate misleading, harmful, and toxic content.
Acknowledgements
The ILSP team utilized Amazon's cloud computing services via GRNET under the [OCRE Cloud framework](https://www.ocre - project.eu/), providing Amazon Web Services for the Greek Academic and Research Community.
Citation
@misc{voukoutis2024meltemiopenlargelanguage,
title={Meltemi: The first open Large Language Model for Greek},
author={Leon Voukoutis and Dimitris Roussis and Georgios Paraskevopoulos and Sokratis Sofianopoulos and Prokopis Prokopidis and Vassilis Papavasileiou and Athanasios Katsamanis and Stelios Piperidis and Vassilis Katsouros},
year={2024},
eprint={2407.20743},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2407.20743},
}
đ License
This model is licensed under the Apache - 2.0 license.