Meltemi-7B-v1 Open-Source Large Language Model - Enhance Greek and English Abilities to Facilitate Communication and Expression

Meltemi 7B V1

Developed by ilsp

The first large-scale Greek foundational language model, based on the Mistral-7B architecture, enhanced with 40 billion tokens of Greek and English corpus to improve Greek language capabilities

Large Language Model

Transformers

Supports Multiple LanguagesOpen Source License:Apache-2.0 #Greek Large Language Model #Bilingual Enhancement #Medical Q&A

Downloads 49

Release Time : 3/22/2024

Model Overview

The first Greek large language model trained by the Institute for Language and Speech Processing (ILSP) under the Athens Research and Innovation Center, supporting Greek and English with text generation capabilities

Model Features

Greek Language Optimization

Extended the Mistral-7B tokenizer to support Greek vocabulary, enhancing Greek language capabilities with 28.5 billion Greek tokens

Bilingual Capabilities

Maintained English proficiency while enhancing Greek capabilities, preventing catastrophic forgetting

Long Context Support

Supports context lengths of up to 8192 tokens

High-Quality Corpus

Trained on a rigorously cleaned and deduplicated 40 billion token corpus

Model Capabilities

Greek text generation

English text generation

Bilingual text processing

Use Cases

Education

Greek Language Learning Assistance

Helps learners understand and generate Greek content

Medical

Medical Q&A

Question-answering capabilities based on Greek medical exam datasets

Achieved 41.0% accuracy on the Greek Medical MCQA test set

🚀 Meltemi: A large foundation Language Model for the Greek language

Meltemi is the first Greek Large Language Model (LLM) trained by the Institute for Language and Speech Processing at Athena Research & Innovation Center. Built on Mistral-7B, it extends capabilities for Greek via continual pretraining on a large Greek text corpus. We offer Meltemi-7B-v1 and an instruction fine - tuned version Meltemi-7B-Instruct-v1.

🚀 Quick Start

Newer Version Notice

This model has been superseded by a newer version (v1.5) here.

Model Introduction

We introduce Meltemi, leveraging Mistral-7B and enhancing it for Greek through pretraining on a large corpus of high - quality Greek texts.

image/png

✨ Features

Vocabulary Extension: Extended the Mistral - 7B tokenizer with Greek tokens.
Context Length: 8192 context length.
Pretraining Extension: Extended the pretraining of Mistral - 7B for Greek proficiency using a large corpus of about 40 billion tokens.
- The corpus includes 28.5 billion monolingual Greek tokens from public resources, 10.5 billion monolingual English tokens, and 600 million Greek - English parallel data tokens.
- The corpus has been processed, filtered, and deduplicated for data quality.

Sub - corpus	# Tokens	Percentage
Greek	28,555,902,360	72.0%
English	10,478,414,033	26.4%
Parallel	633,816,023	1.6%
Total	39,668,132,416	100%

💻 Usage Examples

Basic Usage

Please make sure that the BOS token is always included in the tokenized prompts. This might not be the default setting in all evaluation or fine - tuning frameworks.

📚 Documentation

Evaluation

The evaluation suite includes 6 test sets integrated with [lm - eval - harness](https://github.com/EleutherAI/lm - evaluation - harness).

Four machine - translated versions of English benchmarks for Greek: ARC Greek, Truthful QA Greek, HellaSwag Greek, MMLU Greek.
An existing Greek question - answering benchmark: Belebele.
A novel medical question - answering benchmark: Medical MCQA.

The evaluation is performed in a few - shot setting, consistent with the Open LLM leaderboard. The training enhances performance across all Greek test sets by a +14.9% average improvement.

	Medical MCQA EL (15 - shot)	Belebele EL (5 - shot)	HellaSwag EL (10 - shot)	ARC - Challenge EL (25 - shot)	TruthfulQA MC2 EL (0 - shot)	MMLU EL (5 - shot)	Average
Mistral 7B	29.8%	45.0%	36.5%	27.1%	45.8%	35%	36.5%
Meltemi 7B	41.0%	63.6%	61.6%	43.2%	52.1%	47%	51.4%

Ethical Considerations

This model has not been aligned with human preferences, and therefore might generate misleading, harmful, and toxic content.

Acknowledgements

The ILSP team utilized Amazon's cloud computing services via GRNET under the [OCRE Cloud framework](https://www.ocre - project.eu/), providing Amazon Web Services for the Greek Academic and Research Community.

Citation

@misc{voukoutis2024meltemiopenlargelanguage,
      title={Meltemi: The first open Large Language Model for Greek}, 
      author={Leon Voukoutis and Dimitris Roussis and Georgios Paraskevopoulos and Sokratis Sofianopoulos and Prokopis Prokopidis and Vassilis Papavasileiou and Athanasios Katsamanis and Stelios Piperidis and Vassilis Katsouros},
      year={2024},
      eprint={2407.20743},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2407.20743}, 
}

📄 License

This model is licensed under the Apache - 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご