BgGPT-Gemma-2-27B-IT-v1.0 Open-Source Language Model - Accurately Understand Bulgarian Culture and Support English Communication

Bggpt Gemma 2 27B IT V1.0

Developed by INSAIT-Institute

An advanced Bulgarian language model based on Gemma 2 27B developed by INSAIT, featuring exceptional understanding of Bulgarian culture and language while retaining English capabilities.

Large Language Model

Transformers

Supports Multiple Languages#Bulgarian language expert #Bilingual instruction fine-tuning #Enhanced cultural understanding

Downloads 308

Release Time : 11/15/2024

Model Overview

This model is built upon Google's open-source Gemma 2 27B model, undergoing continuous pre-training on approximately 100 billion tokens through a branch-and-merge strategy, followed by fine-tuning on a Bulgarian language instruction dataset.

Model Features

Multilingual capability

Maintains English proficiency while demonstrating exceptional understanding of Bulgarian culture and language.

High performance

Outperforms significantly larger models like Alibaba's Qwen 2.5 72B and Meta's Llama3.1 70B in Bulgarian language tasks.

Instruction fine-tuning

Fine-tuned on a newly constructed Bulgarian instruction dataset generated from real conversations.

Model Capabilities

Text generation

Logical reasoning

Mathematical problem solving

Common-sense knowledge Q&A

Language understanding

Use Cases

Education

High school problem solving

Solving high school-level natural science and social science problems

Excellent performance on GSM-8k and MON benchmarks

Chatbot

Bulgarian language chat

Exceptional performance in Bulgarian language chat scenarios

Significantly outperforms commercial small models like Claude Haiku and GPT-4o-mini, matching top-tier commercial models

🚀 INSAIT-Institute/BgGPT-Gemma-2-27B-IT-v1.0

INSAIT presents BgGPT-Gemma-2-27B-IT-v1.0, a cutting - edge Bulgarian language model based on Google's Gemma 2. It's free to use and performs well in both Bulgarian and English.

🚀 Quick Start

Installation

First, install the latest version of the transformers library:

pip install -U 'transformers[torch]'

Loading the Model

Then load the model in transformers:

from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    "INSAIT-Institute/BgGPT-Gemma-2-27B-IT-v1.0",
    torch_dtype=torch.bfloat16,
    attn_implementation="eager",
    device_map="auto",
)

✨ Features

Multilingual Proficiency: BgGPT-Gemma-2-27B-IT-v1.0 is proficient in both Bulgarian and English, achieving outstanding performance in both languages.
Free to Use: The model is free to use under the Gemma Terms of Use.
State - of - the - Art Performance: It outperforms much larger models in Bulgarian benchmarks and retains excellent English performance inherited from the original Google Gemma 2 models.

📦 Installation

The installation steps are as follows:

pip install -U 'transformers[torch]'

💻 Usage Examples

Basic Usage

from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    "INSAIT-Institute/BgGPT-Gemma-2-27B-IT-v1.0",
    torch_dtype=torch.bfloat16,
    attn_implementation="eager",
    device_map="auto",
)

Advanced Usage

Recommended Parameters

from transformers import GenerationConfig

generation_params = GenerationConfig(
    max_new_tokens=2048,              # Choose maximum generation tokens
    temperature=0.1,
    top_k=25,
    top_p=1,
    repetition_penalty=1.1,
    eos_token_id=[1,107],
    do_sample=True
)

Instruction Format

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
    "INSAIT-Institute/BgGPT-Gemma-2-27B-IT-v1.0",
    use_default_system_prompt=False,
)

messages = [
    {"role": "user", "content": "Кога е основан Софийският университет?"},
]

input_ids = tokenizer.apply_chat_template(
  messages,
  return_tensors="pt",
  add_generation_prompt=True,
  return_dict=True
)

outputs = model.generate(
  **input_ids,
  generation_config=generation_params
)
print(tokenizer.decode(outputs[0]))

Use with vLLM

from vllm import LLM, SamplingParams
from vllm.inputs import TokensPrompt
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
    "INSAIT-Institute/BgGPT-Gemma-2-27B-IT-v1.0",
    use_default_system_prompt=False,
)

sampling_params = SamplingParams(
    max_tokens=2048,
    temperature=0.1,
    top_k=25,
    top_p=1,
    repetition_penalty=1.1,
    stop_token_ids=[1, 107],
)

llm = LLM(
    model="INSAIT-Institute/BgGPT-Gemma-2-27B-IT-v1.0",
    dtype="bfloat16",
    enforce_eager=True
)

messages = [
    {"role": "user", "content": "Кога е основан Софийският университет?"},
]

formatted_prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

input_ids = tokenizer(
    formatted_prompt,
    add_special_tokens=False
).input_ids

prompt = TokensPrompt(prompt_token_ids=input_ids)

output = llm.generate(
    prompt,
    sampling_params
)

generated_text = output[0].outputs[0].text
print(generated_text)

📚 Documentation

Model Description

The model is built on top of Google’s Gemma 2 27B open models. It was continuously pre - trained on around 100 billion tokens (85 billion in Bulgarian) using the Branch - and - Merge strategy INSAIT presented at EMNLP’24. This allows the model to gain outstanding Bulgarian cultural and linguistic capabilities while retaining its English performance. During the pre - training stage, various datasets are used, including Bulgarian web crawl data, freely available datasets such as Wikipedia, a range of specialized Bulgarian datasets sourced by the INSAIT Institute, and machine translations of popular English datasets. The model was then instruction - fine - tuned on a newly constructed Bulgarian instruction dataset created using real - world conversations. For more information, check the blogpost.

Benchmarks and Results

image/png

The model is evaluated on a set of standard English benchmarks, a translated version of them in Bulgarian, as well as Bulgarian specific benchmarks. These benchmarks test logical reasoning, mathematics, knowledge, language understanding and other skills of the models and are provided at https://github.com/insait-institute/lm-evaluation-harness-bg. The results show the excellent abilities of both 9B and 27B models in Bulgarian, allowing them to outperform much larger models, including Alibaba’s Qwen 2.5 72B and Meta’s Llama3.1 70B. Both BgGPT 9B and BgGPT 27B significantly improve upon the previous version of BgGPT based on Mistral - 7B. The models also retain the excellent English performance inherited from the original Google Gemma 2 models.

Chat Preference

image/png

The BgGPT 27B model is evaluated in terms of chat performance on thousands of real - world Bulgarian conversations from around 100 different topics. The results show that the model significantly surpasses the performance of the smaller variants of commercial models in Bulgarian chat performance and is on par with the best commercial models according to GPT - 4o itself.

Use with GGML / llama.cpp

The model and instructions for usage in GGUF format are available at INSAIT-Institute/BgGPT-Gemma-2-27B-IT-v1.0-GGUF.

Community Feedback

The community's feedback is welcome to help improve BgGPT. You can share your experience using the model through Hugging Face's community discussion feature or contact the team at bggpt@insait.ai.

Summary

Property	Details
Finetuned from	google/gemma-2-27b-it; google/gemma-2-27b;
Model Type	Causal decoder - only transformer language model
Language	Bulgarian and English
Contact	bggpt@insait.ai
License	BgGPT is distributed under Gemma Terms of Use

🔧 Technical Details

The model uses the Branch - and - Merge strategy presented at EMNLP’24 during pre - training. It is pre - trained on around 100 billion tokens (85 billion in Bulgarian) and then instruction - fine - tuned on a Bulgarian instruction dataset created from real - world conversations.

📄 License

BgGPT is distributed under Gemma Terms of Use

⚠️ Important Note

Models based on Gemma 2 such as BgGPT-Gemma-2-27B-IT-v1.0 do not support flash attention. Using it results in degraded performance.

💡 Usage Tip

For optimal performance, use the recommended parameters for text generation as extensively tested by the developers.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご