🚀 Llama-Krikri-8B-Base: A large foundation Language Model for the Greek language
Following the release of Meltemi-7B on the 26th March 2024, Llama-Krikri-8B-Base extends the capabilities of Llama-3.1-8B for Greek language processing.
🚀 Quick Start
With Transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
device = "cuda"
model = AutoModelForCausalLM.from_pretrained("ilsp/Llama-Krikri-8B-Base")
tokenizer = AutoTokenizer.from_pretrained("ilsp/Llama-Krikri-8B-Base")
model.to(device)
input_text = tokenizer("Ένα κρικρί διαφέρει απο ένα λάμα επειδή", return_tensors='pt').to(device)
outputs = model.generate(input_text['input_ids'], max_new_tokens=256, do_sample=True)
print(tokenizer.batch_decode(outputs)[0])
With OpenAI compatible server via vLLM
vllm serve ilsp/Llama-Krikri-8B-Base \
--enforce-eager \
--dtype 'bfloat16' \
--api-key token-abc123
Then, the model can be used through Python using:
from openai import OpenAI
api_key = "token-abc123"
base_url = "http://localhost:8000/v1"
client = OpenAI(
api_key=api_key,
base_url=base_url,
)
response = client.completions.create(model="ilsp/Llama-Krikri-8B-Base",
prompt="Η εκπαίδευση μεγάλων γλωσσικών μοντέλων περιλαμβάνει")
print(response.choices[0].text)
✨ Features
- Vocabulary extension of the Llama-3.1 tokenizer with Greek tokens
- 128k context length (approximately 80,000 Greek words)
- Extended pretraining of Llama-3.1-8B with added proficiency for the Greek language, utilizing a large training corpus
📦 Installation
The README does not provide explicit installation steps. However, the usage examples imply that you need to have the transformers
and openai
libraries installed, along with vLLM
if you want to use the OpenAI compatible server. You can install them using pip
:
pip install transformers openai vllm
📚 Documentation
Model Information
- Vocabulary: The Llama-3.1 tokenizer is extended with Greek tokens.
- Context Length: It has a 128k context length, approximately equivalent to 80,000 Greek words.
- Training Corpus:
- It includes 56.7 billion monolingual Greek tokens, 21 billion monolingual English tokens, 5.5 billion Greek-English parallel data tokens, and 7.8 billion math and code tokens.
- The total corpus size is 91 billion tokens, and chosen subsets were upsampled to 110 billion tokens.
Property |
Details |
Model Type |
Llama-Krikri-8B-Base, a large foundation language model for Greek |
Training Data |
A corpus with 56.7 billion Greek tokens, 21 billion English tokens, 5.5 billion parallel data tokens, and 7.8 billion math/code tokens |
Evaluation
- Greek Benchmarks: Llama-Krikri-8B-Base shows a +10.8% average improvement over Llama-3.1-8B.
- English Benchmarks: It improves average performance across all English test sets by +0.8%.
Greek Benchmarks
|
Medical MCQA EL (15-shot) |
Belebele EL (5-shot) |
HellaSwag EL (10-shot) |
ARC-Challenge EL (25-shot) |
TruthfulQA MC2 EL (0-shot) |
MMLU EL (5-shot) |
Average |
Meltemi 7B v1.5 |
42.2% |
61.0% |
53.8% |
40.0% |
49.0% |
41.2% |
47.9% |
Llama-3.1-8B |
33.4% |
72.8% |
52.1% |
39.9% |
51.1% |
42.6% |
48.7% |
Llama-Krikri-8B |
53.8% |
82.7% |
64.6% |
49.4% |
54.2% |
52.0% |
59.5% |
English Benchmarks
|
Winogrande (5-shot) |
Belebele (5-shot) |
HellaSwag (10-shot) |
ARC-Challenge (25-shot) |
TruthfulQA MC2 (0-shot) |
MMLU (5-shot) |
Average |
Meltemi 7B v1.5 |
73.4% |
77.7% |
79.6% |
54.1% |
40.5% |
56.9% |
63.7% |
Llama-3.1-8B |
74.6% |
71.5% |
82.0% |
58.5% |
44.2% |
66.2% |
66.2% |
Llama-Krikri-8B |
72.6% |
79.8% |
80.7% |
57.8% |
44.8% |
65.1% |
67.0% |
Ethical Considerations
⚠️ Important Note
This model has not been aligned with human preferences, and therefore might generate misleading, harmful, and toxic content.
Acknowledgements
The ILSP team utilized Amazon's cloud computing services, which were made available via GRNET under the OCRE Cloud framework, providing Amazon Web Services for the Greek Academic and Research Community.
📄 License
The model is released under the llama3.1
license.