# MKLLM-7B-Instruct Open-Source Large Language Model - Supports High-Quality Text Interaction in Macedonian and English

MKLLM 7B Instruct

Developed by trajkovnikola

MKLLM-7B is an open-source large language model for the Macedonian language, built through continued pre-training on mixed Macedonian and English texts based on the Mistral-7B-v0.1 model.

Large Language Model

Transformers

Supports Multiple Languages#Macedonian language optimization #Instruction fine-tuning #Low-resource LLM

Downloads 31

Release Time : 4/25/2025

Model Overview

MKLLM-7B-Instruct is the instruction-tuned version of MKLLM-7B, trained with full instruction sets using the chatml dialogue format, demonstrating excellent performance in Macedonian language understanding and processing.

Model Features

Macedonian language optimization

Specially optimized for Macedonian language with superior comprehension and generation capabilities compared to similar models

Instruction tuning

Fully trained with instruction sets using chatml dialogue format, suitable for conversational interactions

Multilingual support

Supports both Macedonian and English processing

Model Capabilities

Macedonian text generation

English text generation

Conversational interaction

Instruction understanding and execution

Use Cases

Language processing

Macedonian Q&A system

Building intelligent Q&A systems for Macedonian-speaking users

Outperforms Llama3-8B-Instruct and Mistral-7B-Instruct-v0.3 in Macedonian benchmark tests

Multilingual chatbot

Developing chatbots supporting both Macedonian and English

Capable of fluently generating content in both Macedonian and English

🚀 MKLLM-7B-Instruct

MKLLM-7B is an open-source Large Language Model tailored for the Macedonian language. It is built upon the outstanding Mistral-7B-v0.1 model through continued pretraining on a blend of Macedonian and English text.

During training, a corpus of approximately 300M tokens, repeated over 2 epochs, was utilized. Although this might seem small compared to other similar projects, the resulting model demonstrates remarkable proficiency in understanding and processing the Macedonian language.

This is the instruction-tuned version of MKLLM-7B. It was developed by taking MKLLM-7B and conducting full instruction training with axolotl, using the chatml format for conversations.

We evaluated the model against Meta's Llama3-8B-Instruct and Mistral's Mistral-7B-Instruct-v0.3 on a set of benchmarks translated into Macedonian. The MKLLM-7B-Instruct outperforms both leading models in its category.

Notably, these benchmarks mainly focus on understanding and do not assess generation capabilities and fluency. We believe there is an even more significant performance gap in these areas, as MKLLM-7B-Instruct generates much more coherent Macedonian text.

The benchmarking was carried out using: https://github.com/N13T/mk-llm-eval image/png

🚀 Quick Start

To leverage the instruction training, your prompt should adhere to the chatml format:

<|im_start|>system
Разговор помеѓу љубопитен корисник и асистент со вештачка интелигенција. Асистентот дава корисни, детални и љубезни одговори на прашањата на корисникот.<|im_end|>
<|im_start|>user
Која планета е позната како 'Црвената Планета'?<|im_end|>
<|im_start|>assistant
Марс<|im_end|>

This prompt is available as a chat template, which means you can format messages using the tokenizer.apply_chat_template() method:

messages = [
    {"role": "system", "content": "Разговор помеѓу љубопитен корисник и асистент со вештачка интелигенција. Асистентот дава корисни, детални и љубезни одговори на прашањата на корисникот."},
    {"role": "user", "content": "Која планета е позната како 'Црвената Планета'?"}
]
gen_input = tokenizer.apply_chat_template(messages, 
                                          tokenize=True,
                                          return_dict=True,
                                          return_tensors="pt",
                                          add_generation_prompt=True).to("cuda")
with torch.no_grad():
  generated_ids = model.generate(**gen_input, max_new_tokens=150,
                                                do_sample=True,
                                                temperature=0.1,
                                                repetition_penalty=1.1,
                                 )
print(tokenizer.decode(generated_ids[0][prompt["input_ids"].shape[1]:], skip_special_tokens=False))

✨ Features

Open-source: MKLLM-7B-Instruct is an open-source model, allowing for community contributions and transparency.
Macedonian Focus: Specifically designed for the Macedonian language, enabling better understanding and processing of Macedonian text.
Instruction Tuned: The model has undergone instruction training using axolotl with the chatml format, enhancing its ability to follow instructions.
Benchmark Performance: Outperforms leading models in its category on a set of Macedonian benchmarks.

📄 License

This project is licensed under the cc-by-nc-sa-4.0 license.

⚠️ Important Note

MKLLM-7B-Instruct may hallucinate and produce factually incorrect output, especially when discussing Macedonian topics due to the relatively small training dataset.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご