MamayLM-Gemma-2-9B-IT-v0.1 Open Source Model - Accurately Understand Ukrainian Culture and Language, with Excellent English Proficiency

Mamaylm Gemma 2 9B IT V0.1

Developed by INSAIT-Institute

A Ukrainian-optimized model based on Gemma 2 9B, developed by INSAIT Institute, featuring exceptional Ukrainian cultural and language understanding while retaining the original top-tier English capabilities.

Large Language Model

Transformers

Supports Multiple Languages#Ukrainian Language Optimization #Bilingual Instruction Fine-tuning #Cultural Understanding Enhancement

Downloads 476

Release Time : 4/25/2025

Model Overview

This model is built upon Google's open-source Gemma 2 9B model. Through data mixing and model fusion techniques, it underwent continuous pre-training on extensive Ukrainian and English datasets, followed by fine-tuning on Ukrainian instruction datasets, resulting in outstanding performance in Ukrainian language tasks.

Model Features

Ukrainian Language Optimization

Continuously pre-trained on extensive Ukrainian datasets, featuring exceptional Ukrainian cultural and language understanding.

Retained English Capabilities

Perfectly inherits the original Gemma 2's top-tier English capabilities.

Multi-domain Knowledge

Excels in multiple domains, including Ukrainian high school curricula (language & literature/history/mathematics/geography).

Instruction Fine-tuning

Fine-tuned on newly constructed Ukrainian instruction datasets to optimize instruction-following capabilities.

Model Capabilities

Text Generation

World Knowledge & Understanding

Sentence Completion

Logical Reasoning

Common Sense Knowledge

Mathematical Problem Solving

Multi-domain Knowledge Testing

Instruction Following

Use Cases

Education

Ukrainian High School Curriculum Testing

Assess proficiency in Ukrainian high school curricula (language & literature/history/mathematics/geography).

Performs excellently in ZNO tests.

General Q&A

Ukrainian Culture Q&A

Answer questions about Ukrainian culture and history.

Capable of accurately answering questions about Ukrainian culture.

🚀 INSAIT-Institute/MamayLM-Gemma-2-9B-IT-v0.1

INSAIT presents MamayLM-Gemma-2-9B-IT-v0.1, a high - performing Ukrainian language model based on Google's Gemma 2 models, free to use and licensed under Gemma terms.

🚀 Quick Start

Installation

First, install the latest version of the transformers library:

pip install -U 'transformers[torch]'

Loading the Model

Then, load the model in transformers:

from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(
    "INSAIT-Institute/MamayLM-Gemma-2-9B-IT-v0.1",
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
    device_map="auto",
)

✨ Features

Multilingual Capability: The model supports both Ukrainian and English, achieving excellent performance in both languages.
Outstanding Performance: It outperforms much larger models like Alibaba’s Qwen 2.5 72B and Meta’s Llama3.1 70B in Ukrainian benchmarks.
Instruction Fine - Tuning: Leveraging instruction fine - tuning, it can better understand and follow user instructions.

📦 Installation

As described in the quick start section, you need to install the transformers library first:

pip install -U 'transformers[torch]'

💻 Usage Examples

Basic Usage

from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(
    "INSAIT-Institute/MamayLM-Gemma-2-9B-IT-v0.1",
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
    device_map="auto",
)

Advanced Usage

from transformers import GenerationConfig
generation_params = GenerationConfig(
    max_new_tokens=2048,              # Choose maximum generation tokens
    temperature=0.1,
    top_k=25,
    top_p=1,
    repetition_penalty=1.1,
    eos_token_id=[1,107],
    do_sample=True
)
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(
    "INSAIT-Institute/MamayLM-Gemma-2-9B-IT-v0.1",
    use_default_system_prompt=False,
)
messages = [
    {"role": "user", "content": "Хто такий Козак Мамай?"},
]
input_ids = tokenizer.apply_chat_template(
  messages,
  return_tensors="pt",
  add_generation_prompt=True,
  return_dict=True
)
outputs = model.generate(
  **input_ids,
  generation_config=generation_params
)
print(tokenizer.decode(outputs[0]))

📚 Documentation

Model Description

The model is built on top of Google’s Gemma 2 9B open models. It was continuously pre - trained on a large pre - filtered dataset (75B tokens of Ukrainian and English data in total) using data mixing and model merging. This allows the model to gain outstanding Ukrainian cultural and linguistic capabilities while retaining its English performance.

During pre - training, various datasets were used, including Ukrainian web crawl data (FineWeb2), freely available datasets like Wikipedia, specialized Ukrainian datasets, and machine translations of popular English datasets. Then, it was instruction - fine - tuned on a newly constructed Ukrainian instruction dataset created using machine translations of current best English datasets and specialized Ukrainian datasets prepared by the Ukrainian community.

For more information, check our blogpost (English, Ukrainian).

Benchmarks and Results

image/png

The model is evaluated on a set of standard English benchmarks, a translated version in Ukrainian, and Ukrainian - specific benchmarks:

Winogrande challenge: testing world knowledge and understanding
Hellaswag: testing sentence completion
ARC Easy/Challenge: testing logical reasoning
TriviaQA: testing trivia knowledge
GSM - 8k: solving multiple - choice questions in high - school mathematics
MMLU: testing knowledge on a multitude of topics
IFEval: testing instruction - following skills
ZNO: testing knowledge of the Ukrainian high school curriculum in Ukrainian language & literature, history, mathematics and geography

The results show that the model can outperform much larger models in Ukrainian benchmarks and retains excellent English performance.

Recommended Parameters

For optimal performance, we recommend the following parameters for text generation:

from transformers import GenerationConfig
generation_params = GenerationConfig(
    max_new_tokens=2048,              # Choose maximum generation tokens
    temperature=0.1,
    top_k=25,
    top_p=1,
    repetition_penalty=1.1,
    eos_token_id=[1,107],
    do_sample=True
)

In principle, increasing temperature should work adequately as well.

Instruction Format

To leverage instruction fine - tuning, your prompt should begin with a beginning - of - sequence token <bos> and be formatted in the Gemma 2 chat template. <bos> should only be the first token in a chat sequence.

E.g.

<bos><start_of_turn>user
Хто такий Козак Мамай?<end_of_turn>
<start_of_turn>model

This format is also available as a chat template via the apply_chat_template() method.

Use with vLLM

from vllm import LLM, SamplingParams
from vllm.inputs import TokensPrompt
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(
    "INSAIT-Institute/MamayLM-Gemma-2-9B-IT-v0.1",
    use_default_system_prompt=False,
)
sampling_params = SamplingParams(
    max_tokens=2048,
    temperature=0.1,
    top_k=25,
    top_p=1,
    repetition_penalty=1.1,
    stop_token_ids=[1, 107],
)
llm = LLM(
    model="INSAIT-Institute/MamayLM-Gemma-2-9B-IT-v0.1",
    dtype="bfloat16",
    enforce_eager=True
)
messages = [
    {"role": "user", "content": "Хто такий Козак Мамай?"},
]
formatted_prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
input_ids = tokenizer(
    formatted_prompt,
    add_special_tokens=False
).input_ids
prompt = TokensPrompt(prompt_token_ids=input_ids)
output = llm.generate(
    prompt,
    sampling_params
)
generated_text = output[0].outputs[0].text
print(generated_text)

Use with GGML / llama.cpp

The model and instructions for usage in GGUF format are available at INSAIT-Institute/MamayLM-Gemma-2-9B-IT-v0.1-GGUF.

Community Feedback

We welcome feedback from the community to help improve MamayLM. If you have suggestions, encounter any issues, or have ideas for improvements, please:

Share your experience using the model through Hugging Face's community discussion feature or
Contact us at contact@insait.ai

Your real - world usage and insights are valuable in helping us optimize the model's performance and behaviour for various use cases.

Summary

Property	Details
Finetuned from	google/gemma-2-9b-it; google/gemma-2-9b;
Model Type	Causal decoder - only transformer language model
Language	Ukrainian and English
Contact	contact@insait.ai
License	MamayLM is distributed under Gemma Terms of Use

📄 License

MamayLM is distributed under Gemma Terms of Use.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご