Llama-Krikri-8B-Base Open-Source Greek Language Model - Expand Greek Capabilities while Considering English Communication

Llama Krikri 8B Base

Developed by ilsp

Llama-Krikri-8B-Base is a large Greek foundational language model built upon Llama-3.1-8B, extending Greek language capabilities through continued pretraining while maintaining English proficiency.

Large Language Model

Transformers

Supports Multiple Languages#Greek language enhancement #Bilingual capability #Long context processing

Downloads 104

Release Time : 2/7/2025

Model Overview

This model is a new member of the ILSP open-source Greek large model family, focusing on enhancing Greek language processing while supporting English. Suitable for tasks like text generation.

Model Features

Enhanced Greek language capability

Significantly improved Greek language processing through continued pretraining with 56.7 billion Greek tokens.

Bilingual support

Maintains and slightly improves English capability while enhancing Greek, avoiding catastrophic forgetting.

Extended tokenizer

Extended the Llama-3.1 tokenizer to better support Greek vocabulary.

Long context support

Supports 128k context length, approximately 80,000 Greek words.

Model Capabilities

Greek text generation

English text generation

Bilingual text processing

Use Cases

Language processing

Greek content generation

Generate high-quality Greek text content

Average improvement of 10.8% in Greek benchmark tests

Bilingual application development

Develop applications supporting both Greek and English

Average improvement of 0.8% in English benchmark tests

🚀 Llama-Krikri-8B-Base: A large foundation Language Model for the Greek language

Following the release of Meltemi-7B on the 26th March 2024, Llama-Krikri-8B-Base extends the capabilities of Llama-3.1-8B for Greek language processing.

🚀 Quick Start

With Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

device = "cuda"

model = AutoModelForCausalLM.from_pretrained("ilsp/Llama-Krikri-8B-Base")
tokenizer = AutoTokenizer.from_pretrained("ilsp/Llama-Krikri-8B-Base")

model.to(device)

input_text = tokenizer("Ένα κρικρί διαφέρει απο ένα λάμα επειδή", return_tensors='pt').to(device)
outputs = model.generate(input_text['input_ids'], max_new_tokens=256, do_sample=True)

print(tokenizer.batch_decode(outputs)[0])

With OpenAI compatible server via vLLM

vllm serve ilsp/Llama-Krikri-8B-Base \
  --enforce-eager \
  --dtype 'bfloat16' \
  --api-key token-abc123

Then, the model can be used through Python using:

from openai import OpenAI

api_key = "token-abc123"
base_url = "http://localhost:8000/v1"

client = OpenAI(
    api_key=api_key,
    base_url=base_url,
)

response = client.completions.create(model="ilsp/Llama-Krikri-8B-Base",
                                     prompt="Η εκπαίδευση μεγάλων γλωσσικών μοντέλων περιλαμβάνει")
print(response.choices[0].text)

✨ Features

Vocabulary extension of the Llama-3.1 tokenizer with Greek tokens
128k context length (approximately 80,000 Greek words)
Extended pretraining of Llama-3.1-8B with added proficiency for the Greek language, utilizing a large training corpus

📦 Installation

The README does not provide explicit installation steps. However, the usage examples imply that you need to have the transformers and openai libraries installed, along with vLLM if you want to use the OpenAI compatible server. You can install them using pip:

pip install transformers openai vllm

📚 Documentation

Model Information

Vocabulary: The Llama-3.1 tokenizer is extended with Greek tokens.
Context Length: It has a 128k context length, approximately equivalent to 80,000 Greek words.
Training Corpus:
- It includes 56.7 billion monolingual Greek tokens, 21 billion monolingual English tokens, 5.5 billion Greek-English parallel data tokens, and 7.8 billion math and code tokens.
- The total corpus size is 91 billion tokens, and chosen subsets were upsampled to 110 billion tokens.

Property	Details
Model Type	Llama-Krikri-8B-Base, a large foundation language model for Greek
Training Data	A corpus with 56.7 billion Greek tokens, 21 billion English tokens, 5.5 billion parallel data tokens, and 7.8 billion math/code tokens

Evaluation

Greek Benchmarks: Llama-Krikri-8B-Base shows a +10.8% average improvement over Llama-3.1-8B.
English Benchmarks: It improves average performance across all English test sets by +0.8%.

Greek Benchmarks

	Medical MCQA EL (15-shot)	Belebele EL (5-shot)	HellaSwag EL (10-shot)	ARC-Challenge EL (25-shot)	TruthfulQA MC2 EL (0-shot)	MMLU EL (5-shot)	Average
Meltemi 7B v1.5	42.2%	61.0%	53.8%	40.0%	49.0%	41.2%	47.9%
Llama-3.1-8B	33.4%	72.8%	52.1%	39.9%	51.1%	42.6%	48.7%
Llama-Krikri-8B	53.8%	82.7%	64.6%	49.4%	54.2%	52.0%	59.5%

English Benchmarks

	Winogrande (5-shot)	Belebele (5-shot)	HellaSwag (10-shot)	ARC-Challenge (25-shot)	TruthfulQA MC2 (0-shot)	MMLU (5-shot)	Average
Meltemi 7B v1.5	73.4%	77.7%	79.6%	54.1%	40.5%	56.9%	63.7%
Llama-3.1-8B	74.6%	71.5%	82.0%	58.5%	44.2%	66.2%	66.2%
Llama-Krikri-8B	72.6%	79.8%	80.7%	57.8%	44.8%	65.1%	67.0%

Ethical Considerations

⚠️ Important Note

This model has not been aligned with human preferences, and therefore might generate misleading, harmful, and toxic content.

Acknowledgements

The ILSP team utilized Amazon's cloud computing services, which were made available via GRNET under the OCRE Cloud framework, providing Amazon Web Services for the Greek Academic and Research Community.

📄 License

The model is released under the llama3.1 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご