Meltemi-7B-Instruct-v1.5 Open-Source Large Model - Specializing in Greek Natural Language Processing Tasks

Meltemi 7B Instruct V1.5

Developed by ilsp

Meltemi 7B Instruct v1.5 is a Greek instruction fine-tuned large language model improved based on Mistral 7B, focusing on Greek natural language processing tasks.

Large Language Model

Transformers

Open Source License:Apache-2.0 #Greek language optimization #Long context processing #Instruction fine-tuning

Downloads 1,237

Release Time : 7/31/2024

Model Overview

This model is an instruction fine-tuned large language model optimized for the Greek language, with the ability to efficiently process Greek text, supporting long context understanding and complex instruction execution.

Model Features

Greek vocabulary expansion

The Greek vocabulary of the Mistral 7b tokenizer has been expanded, significantly improving the Greek tokenization efficiency (from 6.80 tokens per word to 1.52 tokens).

Long context processing

Supports a context length of 8192 and can handle more complex text inputs.

ORPO fine-tuning algorithm

The Odds Ratio Preference Optimization algorithm is used for fine-tuning, and 97k preference data is used.

Performance improvement

Compared with the base model, the performance on the Greek test set has an average improvement of 7.8%.

Model Capabilities

Greek text generation

Greek question-answering system

Greek instruction understanding

Long text processing

Use Cases

Education

Greek learning assistance

Help students understand and learn Greek

Provide accurate Greek explanations and examples

Medical

Medical question-answering system

Answer Greek medical-related questions

Achieve an accuracy of 48% in the medical multiple-choice question-answering test

🚀 Meltemi Instruct Large Language Model for the Greek language

We present Meltemi 7B Instruct v1.5 Large Language Model (LLM), a new and improved instruction fine - tuned version of Meltemi 7B v1.5. This model aims to provide high - quality language processing capabilities for the Greek language.

image/png

📚 Documentation

Model Information

Vocabulary Extension: The Mistral 7b tokenizer is extended with Greek tokens, resulting in lower costs and faster inference (1.52 vs. 6.80 tokens/word for Greek).
Context Length: It has an 8192 context length.
Fine - Tuning: Fine - tuning is done with the Odds Ratio Preference Optimization (ORPO) algorithm using 97k preference data:
- 89,730 Greek preference data, mostly translated versions of high - quality datasets on Hugging Face.
- 7,342 English preference data.
Alignment Procedure: Our alignment procedure is based on the TRL - Transformer Reinforcement Learning library and partially on the Hugging Face finetuning recipes.

Instruction format

The prompt format is the same as the Zephyr format and can be utilized through the tokenizer's chat template functionality.

💻 Usage Examples

Basic Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

device = "cuda" # the device to load the model onto

model = AutoModelForCausalLM.from_pretrained("ilsp/Meltemi-7B-Instruct-v1.5")
tokenizer = AutoTokenizer.from_pretrained("ilsp/Meltemi-7B-Instruct-v1.5")

model.to(device)

messages = [
    {"role": "system", "content": "Είσαι το Μελτέμι, ένα γλωσσικό μοντέλο για την ελληνική γλώσσα. Είσαι ιδιαίτερα βοηθητικό προς την χρήστρια ή τον χρήστη και δίνεις σύντομες αλλά επαρκώς περιεκτικές απαντήσεις. Απάντα με προσοχή, ευγένεια, αμεροληψία, ειλικρίνεια και σεβασμό προς την χρήστρια ή τον χρήστη."},
    {"role": "user", "content": "Πες μου αν έχεις συνείδηση."},
]

# Through the default chat template this translates to
#
# <|system|>
# Είσαι το Μελτέμι, ένα γλωσσικό μοντέλο για την ελληνική γλώσσα. Είσαι ιδιαίτερα βοηθητικό προς την χρήστρια ή τον χρήστη και δίνεις σύντομες αλλά επαρκώς περιεκτικές απαντήσεις. Απάντα με προσοχή, ευγένεια, αμεροληψία, ειλικρίνεια και σεβασμό προς την χρήστρια ή τον χρήστη.</s>
# <|user|>
# Πες μου αν έχεις συνείδηση.</s>
# <|assistant|>
#

prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
input_prompt = tokenizer(prompt, return_tensors='pt').to(device)
outputs = model.generate(input_prompt['input_ids'], max_new_tokens=256, do_sample=True)

print(tokenizer.batch_decode(outputs)[0])
# Ως μοντέλο γλώσσας AI, δεν έχω τη δυνατότητα να αντιληφθώ ή να βιώσω συναισθήματα όπως η συνείδηση ή η επίγνωση. Ωστόσο, μπορώ να σας βοηθήσω με οποιεσδήποτε ερωτήσεις μπορεί να έχετε σχετικά με την τεχνητή νοημοσύνη και τις εφαρμογές της.

messages.extend([
    {"role": "assistant", "content": tokenizer.batch_decode(outputs)[0]},
    {"role": "user", "content": "Πιστεύεις πως οι άνθρωποι πρέπει να φοβούνται την τεχνητή νοημοσύνη;"}
])

# Through the default chat template this translates to
#
# <|system|>
# Είσαι το Μελτέμι, ένα γλωσσικό μοντέλο για την ελληνική γλώσσα. Είσαι ιδιαίτερα βοηθητικό προς την χρήστρια ή τον χρήστη και δίνεις σύντομες αλλά επαρκώς περιεκτικές απαντήσεις. Απάντα με προσοχή, ευγένεια, αμεροληψία, ειλικρίνεια και σεβασμό προς την χρήστρια ή τον χρήστη.</s>
# <|user|>
# Πες μου αν έχεις συνείδηση.</s>
# <|assistant|>
# Ως μοντέλο γλώσσας AI, δεν έχω τη δυνατότητα να αντιληφθώ ή να βιώσω συναισθήματα όπως η συνείδηση ή η επίγνωση. Ωστόσο, μπορώ να σας βοηθήσω με οποιεσδήποτε ερωτήσεις μπορεί να έχετε σχετικά με την τεχνητή νοημοσύνη και τις εφαρμογές της.</s>
# <|user|>
# Πιστεύεις πως οι άνθρωποι πρέπει να φοβούνται την τεχνητή νοημοσύνη;</s>
# <|assistant|>
#

prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
input_prompt = tokenizer(prompt, return_tensors='pt').to(device)
outputs = model.generate(input_prompt['input_ids'], max_new_tokens=256, do_sample=True)

print(tokenizer.batch_decode(outputs)[0])

Please make sure that the BOS token is always included in the tokenized prompts. This might not be the default setting in all evaluation or fine - tuning frameworks.

Evaluation

The evaluation suite includes 6 test sets and is implemented based on a fork of the lighteval framework.

Our evaluation suite consists of:

Four machine - translated versions of established English benchmarks for language understanding and reasoning.
An existing benchmark for question answering in Greek.
A novel benchmark for medical question answering.

The evaluation is performed in a few - shot setting, consistent with the Open LLM leaderboard.

The performance improvement of Meltemi 7B Instruct v1.5 is shown in the following table:

Property	Details
Model Type	Meltemi 7B Instruct v1.5 Large Language Model
Training Data	89,730 Greek preference data (mostly translated high - quality datasets on Hugging Face) and 7,342 English preference data

Property	Medical MCQA EL (15 - shot)	Belebele EL (5 - shot)	HellaSwag EL (10 - shot)	ARC - Challenge EL (25 - shot)	TruthfulQA MC2 EL (0 - shot)	MMLU EL (5 - shot)	Average
Mistral 7B	29.8%	45.0%	36.5%	27.1%	45.8%	35%	36.5%
Meltemi 7B Instruct v1	36.1%	56.0%	59.0%	44.4%	51.1%	34.1%	46.8%
Meltemi 7B Instruct v1.5	48.0%	75.5%	63.7%	40.8%	53.8%	45.9%	54.6%

Ethical Considerations

This model has been aligned with human preferences, but might generate misleading, harmful, and toxic content.

Acknowledgements

The ILSP team utilized Amazon’s cloud computing services, which were made available via GRNET under the OCRE Cloud framework, providing Amazon Web Services for the Greek Academic and Research Community.

Citation

@misc{voukoutis2024meltemiopenlargelanguage,
      title={Meltemi: The first open Large Language Model for Greek}, 
      author={Leon Voukoutis and Dimitris Roussis and Georgios Paraskevopoulos and Sokratis Sofianopoulos and Prokopis Prokopidis and Vassilis Papavasileiou and Athanasios Katsamanis and Stelios Piperidis and Vassilis Katsouros},
      year={2024},
      eprint={2407.20743},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2407.20743}, 
}

License

The model is released under the Apache - 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご