🚀 Meltemi Instruct Large Language Model for the Greek language
We present Meltemi 7B Instruct v1.5 Large Language Model (LLM), a new and improved instruction fine - tuned version of Meltemi 7B v1.5. This model aims to provide high - quality language processing capabilities for the Greek language.

📚 Documentation
Model Information
- Vocabulary Extension: The Mistral 7b tokenizer is extended with Greek tokens, resulting in lower costs and faster inference (1.52 vs. 6.80 tokens/word for Greek).
- Context Length: It has an 8192 context length.
- Fine - Tuning: Fine - tuning is done with the Odds Ratio Preference Optimization (ORPO) algorithm using 97k preference data:
- 89,730 Greek preference data, mostly translated versions of high - quality datasets on Hugging Face.
- 7,342 English preference data.
- Alignment Procedure: Our alignment procedure is based on the TRL - Transformer Reinforcement Learning library and partially on the Hugging Face finetuning recipes.
Instruction format
The prompt format is the same as the Zephyr format and can be utilized through the tokenizer's chat template functionality.
💻 Usage Examples
Basic Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
device = "cuda"
model = AutoModelForCausalLM.from_pretrained("ilsp/Meltemi-7B-Instruct-v1.5")
tokenizer = AutoTokenizer.from_pretrained("ilsp/Meltemi-7B-Instruct-v1.5")
model.to(device)
messages = [
{"role": "system", "content": "Είσαι το Μελτέμι, ένα γλωσσικό μοντέλο για την ελληνική γλώσσα. Είσαι ιδιαίτερα βοηθητικό προς την χρήστρια ή τον χρήστη και δίνεις σύντομες αλλά επαρκώς περιεκτικές απαντήσεις. Απάντα με προσοχή, ευγένεια, αμεροληψία, ειλικρίνεια και σεβασμό προς την χρήστρια ή τον χρήστη."},
{"role": "user", "content": "Πες μου αν έχεις συνείδηση."},
]
prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
input_prompt = tokenizer(prompt, return_tensors='pt').to(device)
outputs = model.generate(input_prompt['input_ids'], max_new_tokens=256, do_sample=True)
print(tokenizer.batch_decode(outputs)[0])
messages.extend([
{"role": "assistant", "content": tokenizer.batch_decode(outputs)[0]},
{"role": "user", "content": "Πιστεύεις πως οι άνθρωποι πρέπει να φοβούνται την τεχνητή νοημοσύνη;"}
])
prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
input_prompt = tokenizer(prompt, return_tensors='pt').to(device)
outputs = model.generate(input_prompt['input_ids'], max_new_tokens=256, do_sample=True)
print(tokenizer.batch_decode(outputs)[0])
Please make sure that the BOS token is always included in the tokenized prompts. This might not be the default setting in all evaluation or fine - tuning frameworks.
Evaluation
The evaluation suite includes 6 test sets and is implemented based on a fork of the lighteval framework.
Our evaluation suite consists of:
- Four machine - translated versions of established English benchmarks for language understanding and reasoning.
- An existing benchmark for question answering in Greek.
- A novel benchmark for medical question answering.
The evaluation is performed in a few - shot setting, consistent with the Open LLM leaderboard.
The performance improvement of Meltemi 7B Instruct v1.5 is shown in the following table:
Property |
Details |
Model Type |
Meltemi 7B Instruct v1.5 Large Language Model |
Training Data |
89,730 Greek preference data (mostly translated high - quality datasets on Hugging Face) and 7,342 English preference data |
Property |
Medical MCQA EL (15 - shot) |
Belebele EL (5 - shot) |
HellaSwag EL (10 - shot) |
ARC - Challenge EL (25 - shot) |
TruthfulQA MC2 EL (0 - shot) |
MMLU EL (5 - shot) |
Average |
Mistral 7B |
29.8% |
45.0% |
36.5% |
27.1% |
45.8% |
35% |
36.5% |
Meltemi 7B Instruct v1 |
36.1% |
56.0% |
59.0% |
44.4% |
51.1% |
34.1% |
46.8% |
Meltemi 7B Instruct v1.5 |
48.0% |
75.5% |
63.7% |
40.8% |
53.8% |
45.9% |
54.6% |
Ethical Considerations
This model has been aligned with human preferences, but might generate misleading, harmful, and toxic content.
Acknowledgements
The ILSP team utilized Amazon’s cloud computing services, which were made available via GRNET under the OCRE Cloud framework, providing Amazon Web Services for the Greek Academic and Research Community.
Citation
@misc{voukoutis2024meltemiopenlargelanguage,
title={Meltemi: The first open Large Language Model for Greek},
author={Leon Voukoutis and Dimitris Roussis and Georgios Paraskevopoulos and Sokratis Sofianopoulos and Prokopis Prokopidis and Vassilis Papavasileiou and Athanasios Katsamanis and Stelios Piperidis and Vassilis Katsouros},
year={2024},
eprint={2407.20743},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2407.20743},
}
License
The model is released under the Apache - 2.0 license.