🚀 MKLLM-7B-Instruct
MKLLM-7B is an open-source Large Language Model tailored for the Macedonian language. It is built upon the outstanding Mistral-7B-v0.1 model through continued pretraining on a blend of Macedonian and English text.
During training, a corpus of approximately 300M tokens, repeated over 2 epochs, was utilized. Although this might seem small compared to other similar projects, the resulting model demonstrates remarkable proficiency in understanding and processing the Macedonian language.
This is the instruction-tuned version of MKLLM-7B. It was developed by taking MKLLM-7B and conducting full instruction training with axolotl, using the chatml format for conversations.
We evaluated the model against Meta's Llama3-8B-Instruct and Mistral's Mistral-7B-Instruct-v0.3 on a set of benchmarks translated into Macedonian. The MKLLM-7B-Instruct outperforms both leading models in its category.
Notably, these benchmarks mainly focus on understanding and do not assess generation capabilities and fluency. We believe there is an even more significant performance gap in these areas, as MKLLM-7B-Instruct generates much more coherent Macedonian text.
The benchmarking was carried out using: https://github.com/N13T/mk-llm-eval

🚀 Quick Start
To leverage the instruction training, your prompt should adhere to the chatml format:
<|im_start|>system
Разговор помеѓу љубопитен корисник и асистент со вештачка интелигенција. Асистентот дава корисни, детални и љубезни одговори на прашањата на корисникот.<|im_end|>
<|im_start|>user
Која планета е позната како 'Црвената Планета'?<|im_end|>
<|im_start|>assistant
Марс<|im_end|>
This prompt is available as a chat template, which means you can format messages using the tokenizer.apply_chat_template()
method:
messages = [
{"role": "system", "content": "Разговор помеѓу љубопитен корисник и асистент со вештачка интелигенција. Асистентот дава корисни, детални и љубезни одговори на прашањата на корисникот."},
{"role": "user", "content": "Која планета е позната како 'Црвената Планета'?"}
]
gen_input = tokenizer.apply_chat_template(messages,
tokenize=True,
return_dict=True,
return_tensors="pt",
add_generation_prompt=True).to("cuda")
with torch.no_grad():
generated_ids = model.generate(**gen_input, max_new_tokens=150,
do_sample=True,
temperature=0.1,
repetition_penalty=1.1,
)
print(tokenizer.decode(generated_ids[0][prompt["input_ids"].shape[1]:], skip_special_tokens=False))
✨ Features
- Open-source: MKLLM-7B-Instruct is an open-source model, allowing for community contributions and transparency.
- Macedonian Focus: Specifically designed for the Macedonian language, enabling better understanding and processing of Macedonian text.
- Instruction Tuned: The model has undergone instruction training using axolotl with the chatml format, enhancing its ability to follow instructions.
- Benchmark Performance: Outperforms leading models in its category on a set of Macedonian benchmarks.
📄 License
This project is licensed under the cc-by-nc-sa-4.0 license.
⚠️ Important Note
MKLLM-7B-Instruct may hallucinate and produce factually incorrect output, especially when discussing Macedonian topics due to the relatively small training dataset.
