GaMS-27B-Instruct Open-Source Multilingual Large Model - Free to Use, Optimize Language Communication in the Balkan Peninsula

Gams 27B Instruct

Developed by cjvt

GaMS-27B-Instruct is a multilingual large language model improved based on Google's Gemma 2 series, specifically optimized for languages in the Balkan Peninsula such as Slovenian.

Large Language Model

Safetensors

#Slovenian language optimization #Multilingual translation #Supervised fine-tuning

Downloads 4,492

Release Time : 4/4/2025

Model Overview

GaMS-27B-Instruct is a supervised fine-tuning version with 27B parameters in the GaMS series. It supports Slovenian, English, and some Balkan Peninsula languages, and is suitable for tasks such as text generation and translation.

Model Features

Multilingual support

Specifically optimize the performance of Slovenian while supporting English and major Balkan Peninsula languages.

Continuous pre-training

Adopt a two-stage training strategy: first align parallel corpora, then strengthen with monolingual corpora.

Supervised fine-tuning optimization

Fine-tune with more than 25,000 instruction samples to improve task execution ability.

High-performance computing

Train on the Leonardo HPC using an A100 GPU cluster and optimize with the NeMo framework.

Model Capabilities

Slovenian text generation

Multilingual machine translation

Instruction following

Knowledge question answering

Mathematical problem solving

Use Cases

Content creation

Slovenian content generation

Generate marketing copy, press releases, etc. that conform to local language habits.

Outperform similar open-source models in the SloBench evaluation.

Education and research

Slovenian history question answering

Answer complex questions about Slovenian history and culture.

Achieve an average score of 0.76 in the SuperGLUE task.

Mathematical competition problem solving

Analyze Slovenian mathematical competition questions.

Fine-tune with 150 real questions.

Language services

English-Slovenian translation

Translate professional field documents.

The BERT score is 0.8734, outperforming most open-source translation models.

🚀 Model Card for GaMS-27B-Instruct

GaMS-2B, GaMS-9B, and GaMS-27B are new, improved, and larger models in the GaMS (Generative Model for Slovene) family. These models are based on Google's Gemma 2 family and have been continuously pre - trained on Slovene, English, and some Croatian, Serbian, and Bosnian corpora.

This is the SFT version of the GaMS-27B model.

image/png

🚀 Quick Start

The model can be run through the pipeline API. Here's how you can get started:

from transformers import pipeline

model_id = "cjvt/GaMS-27B-Instruct"

pline = pipeline(
    "text-generation",
    model=model_id,
    device_map="cuda" # replace with "mps" to run on a Mac device
)

# Example of response generation
message = [{"role": "user", "content": "Kateri je najpomembnejši dogodek v slovenski zgodovini?"}]
response = pline(message, max_new_tokens=512)
print("Model's response:", response[0]["generated_text"][-1]["content"])

# Example of conversation chain
new_message = response[0]["generated_text"]
new_message.append({"role": "user", "content": "Lahko bolj podrobno opišeš ta dogodek?"})
response = pline(new_message, max_new_tokens=1024)
print("Model's response:", response[0]["generated_text"][-1]["content"])

✨ Features

Multilingual Support: Supports Slovene, English (primary), and Croatian, Bosnian, and Serbian (secondary). It may also work for other languages supported by Gemma 2.
Based on Strong Foundation: Built on Google's Gemma 2 family, with continuous pre - training on diverse corpora.
SFT Version: The SFT version of GaMS - 27B model provides better performance in specific scenarios.

📦 Installation

No specific installation steps are provided in the original README. If you want to use the model, you can follow the code examples above which use the transformers library. You may need to install the transformers library if you haven't already:

pip install transformers

💻 Usage Examples

Basic Usage

from transformers import pipeline

model_id = "cjvt/GaMS-27B-Instruct"

pline = pipeline(
    "text-generation",
    model=model_id,
    device_map="cuda" # replace with "mps" to run on a Mac device
)

# Example of response generation
message = [{"role": "user", "content": "Kateri je najpomembnejši dogodek v slovenski zgodovini?"}]
response = pline(message, max_new_tokens=512)
print("Model's response:", response[0]["generated_text"][-1]["content"])

Advanced Usage

# For multi GPU inference
from transformers import pipeline

model_id = "cjvt/GaMS-27B-Instruct"

pline = pipeline(
    "text-generation",
    model=model_id,
    device_map="auto"
)

# Example of response generation
message = [{"role": "user", "content": "Kateri je najpomembnejši dogodek v slovenski zgodovini?"}]
response = pline(message, max_new_tokens=512)
print("Model's response:", response[0]["generated_text"][-1]["content"])

# Example of conversation chain
new_message = response[0]["generated_text"]
new_message.append({"role": "user", "content": "Lahko bolj podrobno opišeš ta dogodek?"})
response = pline(new_message, max_new_tokens=1024)
print("Model's response:", response[0]["generated_text"][-1]["content"])

📚 Documentation

Basic information

Property	Details
Developed by	A team of researchers at the University of Ljubljana, Faculty for Computer and Information Science. Team members: Domen Vreš, Iztok Lebar Bajec, Tjaša Arčon, Gašper Jelovčan and Marko Robnik - Šikonja.
Languages	Slovene, English (primary), Croatian, Bosnian and Serbian (secondary). The model might also work for other languages supported by Gemma 2, even though it was not continually pretrained on them.
Base model	cjvt/GaMS-27B
License	Gemma

Acknowledgment

The model was developed within the PoVeJMo research program (Adaptive Natural Language Processing with Large Language Models), particularly within the research project titled SloLLaMai -- Open - access computationally efficient models for Slovenian. The program is funded within the Recovery and Resilience Plan by the Slovenian Research and Innovation Agency (ARIS) and NextGenerationEU. The authors also acknowledge the financial support from the Slovenian Research and Innovation Agency (research core funding No. P6 - 0411 -- Language Resources and Technologies for Slovene).

We thank everyone who worked on data collection and preparation, enabling us to train our model. Special thanks go to Nikola Ljubešić, Taja Kuzman, Tjaša Arčon, Jaka Čibej, Simon Krek, Tomaž Erjavec, Iztok Kosem and Tomaž Savodnik.

Data

CPT Data

Model was continually pre - trained in two stages. In the first stage, parallel English - Slovene (and Croatian in some cases) corpora was used to align the languages. In the second stage, the model was trained on separate English, Slovene, Croatian, Bosnian and Serbian corpora.

Parallel alignment corpora

Corpus	Alignment level	# Tokens	Percentage
KAS Abstracts	Document level	31 M	1.65 %
DGT	Separate documents	697 M	36.56 %
MaCoCu Parallel	Separate documents	430 M	22.53 %
CC - News	Paragraph level	749 M	39.25 %
Total		1.91 B

Explanation of each alignment level:

Document level: Parallel documents were concatenated into a single document
Separate documents: Parallel documents were not explicitly aligned
Paragraph level: Paragraphs of parallel documents were concatenated (the first paragraph of Slovene/English document was followed by the first paragraph in the other language, which was then followed by the second paragraph in the first language and so on)

Second stage corpora

Corpus	Language	# Tokens	Percentage
KAS	Slovene	2.77 B	20.34 %
MetaFida*	Slovene	4.66 B	34.18 %
Wikipedia - En (Date: January 23rd 2025)	English	5.45 B	39.99 %
Wikipedia - Sl (Date: January 1st 2025)	Slovene	0.16 B	1.19 %
Wikipedia - Hr (Date: January 1st 2025)	Croatian	0.15 B	1.13 %
Wikipedia - Bs (Date: January 1st 2025)	Bosnian	0.07 B	0.50 %
Wikipedia - Sr - Latin*	Serbian	0.36 B	2.68 %
Total		13.62 B

Remarks:

The following corpora was excluded from MetaFida: dgt15_sl, classlawiki_sl, tweet_sl, janes_tweet, janes_forum, janes_news
Serbian Wikipedia was converted from Cyrillic to Latin

SFT Data

Our training data for SFT consisted out of approximately 25,000 training and 1500 validation examples. The dataset was a mixture of following datasets:

GaMS - Instruct - GEN 1.0
GaMS - Instruct - DH 1.0: 3000 randomly selected examples were chosen from this dataset
GaMS - Instruct - MED 1.0: 3000 randomly selected examples were chosen from this dataset
Parallel corpus EN - SL RSDO4 2.0: additional filtering was done on this corpus. First we ran FastText language identification using [NeMo Curator](https://docs.nvidia.com/nemo - framework/user - guide/latest/datacuration/languageidentification.html) and kept only the examples were source was detected as English and target as Slovene. Next, we ran [COMET](https://huggingface.co/Unbabel/wmt23 - cometkiwi - da - xxl) model to evaluate translations. We kept only the examples with COMET scores higher than 0.945 (approximatelly 8000 examples).
Aya Dataset: only English and Serbian examples were taken from this dataset. Serbian examples were converted from Cyrillic to Latin.
Math competitions: We took PDFs from Slovene national math competitions between years 2001 and 2010. We extracted text from PDFs using [olmOCR](https://huggingface.co/allenai/olmOCR - 7B - 0225 - preview) and manually corrected the extracted text. This gave us around 150 solved math problems.

Training

The model was trained on the Booster partition of Leonardo HPC.

CPT

We continually pretrained the model using NVIDIA NeMo 2.0 framework. The model was trained in BF16 - Mixed precision using tensor parallelism across 8 GPUs, sequence parallelism, and activation recomputation. The model was trained across 32 nodes, each containing 4 A100 64GB GPUs. The parallel alignment training took approximately 8 hours and second stage took approximately 110 hours.

The model was trained using a cosine learning rate scheduler with linear warmup and the following hyperparameters.

Parallel alignment:

warmup steps: 150
minimal learning rate: 5e - 6
maximal learning rate: 2e - 5
constant steps: 0
batch size: 512 (4 million tokens)

Second stage:

warmup steps: 500
minimal learning rate: 5e - 6
maximal learning rate: 5e - 5
constant steps: 100
batch size: 512 (4 million tokens)

SFT

As a contrast to 2B and 9B models, the 27B was supervised fine - tuned using NeMo framework, which enables easier scalng. The model was trained in BF16 precision, using tensor parallelism to split it across 4 GPUs, sequence parallelism, and activation recomputation. The model was trained on 8 nodes with 4 A100 64 GB GPU.

The model was tuned using a cosine learning rate scheduler with linear warmup and the following hyperparameters:

number of epochs: training was done on 5 epochs, but the best performing model according to validation loss was obtained after the second epoch, so we kept that model
batch size: 128
warmup steps: 150
minimal learning rate: 1e - 8
maximal learning rate: 5e - 6
constant steps: 0

Evaluation

The models were evaluated using Slovene SuperGLUE collection of classification tasks on SloBench. Instruct version of the model was also evaluated on translation from English to Slovene and from Slovene to English. Additionally, we evaluated our models on [Slovenian - LLM - Eval](https://huggingface.co/datasets/cjvt/slovenian - llm - eval).

Code for evaluation:

SloBench tasks
[Slovenian - LLM - Eval](https://github.com/SloLama/slovenian - llm - eval)

Slovenian - LLM - Eval results

Comparison between GaMS models, base Gemma 2 models and SlovenianGPT (open source model for Slovene based on Mistral 7B) is shown in the figure below. All models were evaluated in 0 - shot scenario.

image/png

Slobench Results

GaMS 2B, 9B, 27B and 27B - Instruct models were evaluated in 3 - shot scenario, except for MultiRC and translation tasks, where 0 - shot was used. GaMS - 2B - Instruct and GaMS - 9B - Instruct were evaluated in 0 - shot scenarion on all tasks. We used guided decoding to ensure the correct format of the responses.

Slovene SuperGLUE

Rank	Title	Average	BoolQ Accuracy	CB Accuracy	CB F1 Score	CB Average	COPA Accuracy	MultiRC EM	MultiRC F1a Score	MultiRC Average	RTE Accuracy	WSC Accuracy
1	GaMS - 27B	0.7601	0.8333	0.6440	0.5864	0.6152	0.9540	0.3904	0.7504	0.5704	0.7931	0.7945
2	PrešernGPT 0.1	0.7568	0.8333	0.8520	0.5868	0.7194	0.9740	0.4985	0.8061	0.6523	0.8276	0.5342
3	Gemma 2 27B	0.7546	0.8333	0.6680	0.5972	0.6326	0.9140	0.4174	0.7295	0.5735	0.8276	0.7466
4	GaMS - 9B	0.7309	0.7000	0.8400	0.7955	0.8178	0.9000	0.3243	0.6551	0.4897	0.7931	0.6849
5	GaMS - 27B - Instruct	0.7038	0.8333	0.7200	0.6322	0.6761	0.9400	0.0511	0.5813	0.3162	0.7931	0.6644
6	GaMS - 9B - Instruct (0 - shot)	0.6997	0.8000	0.7960	0.7128	0.7544	0.8140	0.0721	0.6174	0.3447	0.7931	0.6918
7	Gemma 2 9B	0.6980	0.8333	0.8240	0.5683	0.6962	0.8700	0.2282	0.5310	0.3796	0.7241	0.6849
9	CroSloEngual BERT	0.6078	0.7333	0.7920	0.7437	0.7679	0.5720	0.0931	0.5241	0.3086	0.6552	0.6096
12	SlovenianGPT - Chat	0.5078	0.7333	0.3920	0.3829	0.3874	0.6840	0.2432	0.4944	0.3688	0.5172	0.3562
13	Gemma 2 2B	0.4876	0.6333	0.4520	0.2123	0.3321	0.5180	0.1471	0.4419	0.2945	0.5862	0.5616
14	GaMS - 2B	0.4790	0.5667	0.6080	0.4880	0.5480	0.5240	0.0631	0.5234	0.2932	0.5517	0.3904
15	GaMS - 2B - Instruct (0 - shot)	0.4608	0.6667	0.5120	0.2611	0.3866	0.5000	0.0420	0.5377	0.2899	0.5517	0.3699
16	GaMS - 1B	0.4604	0.5000	0.6200	0.4565	0.5382	0.4920	0.1351	0.2675	0.2013	0.4828	0.5479
17	GaMS - 1B - Chat	0.4570	0.8000	0.4880	0.3023	0.3951	0.4840	0.1081	0.2428	0.1755	0.5172	0.3692

English to Slovene translation (first 11 models on the benchmark)

Rank	Title	BERT score	BLEU (avg)	METEOR (avg)	CHRF (avg)	BLEU (corpus)	CHRF (corpus)
1	DeepL Translator	0.8812	0.3153	0.5902	0.6205	0.3599	0.6205
2	gemini - 1.5 - pro	0.8791	0.3124	0.5895	0.6176	0.3569	0.6176
3	Sonnet 3.5	0.8789	0.3059	0.5783	0.6204	0.3442	0.6204
4	gpt - 4o	0.8784	0.2958	0.5811	0.6138	0.3379	0.6138
5	EuroLLM - 9B - Instruct	0.8741	0.2927	0.5792	0.6055	0.3273	0.6055
6	GaMS - 27B - Instruct	0.8734	0.2866	0.5688	0.5986	0.3246	0.5986
7	seamless - m4t - v2 - large	0.8731	0.2780	0.5599	0.5947	0.3085	0.5947
8	GaMS - 9B - Instruct	0.8713	0.2773	0.5616	0.5928	0.3209	0.5928
9	Zlatorog	0.8706	0.2834	0.5633	0.6014	0.2903	0.6014
10	RSDO - DS4 - NMT 1.2.2	0.8705	0.2794	0.5634	0.5956	0.3226	0.5956
10	META LLAMA 3.1 405B	0.8705	0.2637	0.5497	0.5930	0.3063	0.5930
12	RSDO - DS4 - NMT 1.2	0.8698	0.2781	0.5602	0.5970	0.3177	0.5970

Slovene to English translation (first 10 models on the benchmark)

Rank	Title	BERT score	BLEU (avg)	METEOR (avg)	CHRF (avg)	BLEU (corpus)	CHRF (corpus)
1	gpt - 4o	0.9496	0.3161	0.6655	0.6297	0.3496	0.6297
2	gemini - 1.5 - pro	0.9489	0.3117	0.6560	0.6237	0.3502	0.6237
3	gpt - 4o - mini	0.9466	0.2976	0.6493	0.6197	0.3328	0.6197
4	GaMS - 27B - Instruct	0.9455	0.2836	0.6270	0.5972	0.3200	0.5972
5	GaMS - 9B - Instruct	0.9454	0.2821	0.6275	0.6018	0.3141	0.6018
6	ChatGPTv1	0.9449	0.2852	0.6415	0.6096	0.3171	0.6096
7	RSDO - DS4 - NMT 1.2.4	0.9434	0.2839	0.6227	0.5967	0.3290	0.5967
8	RSDO - DS4 - NMT 1.2.6	0.9433	0.2832	0.6207	0.5944	0.3295	0.5944
9	RSDO - DS4 - NMT 1.2.2	0.9431	0.2785	0.6184	0.5933	0.3240	0.5933
9	RSDO - DS4 - NMT 1.2	0.9431	0.2805	0.6201	0.5941	0.3231	0.5941
11	eTranslation SLEN	0.9414	0.2729	0.6175	0.5913	0.3119	0.5913

🔧 Technical Details

Model Architecture

The model is based on Google's Gemma 2 family and has been continuously pre - trained on diverse corpora. The training process involves multiple stages and parallelism techniques to improve performance.

Training Process

CPT: The model was trained in two stages using the NVIDIA NeMo 2.0 framework. Tensor parallelism, sequence parallelism, and activation recomputation were used.
SFT: The 27B model was supervised fine - tuned using the NeMo framework, with specific hyperparameters for better performance.

📄 License

The model is licensed under Gemma.

Usage and Limitations (taken from Gemma 2)

Intended Usage

Open Large Language Models (LLMs) have a wide range of applications across various industries and domains. The following list of potential uses is not comprehensive. The purpose of this list is to provide contextual information about the possible use - cases that the model creators considered as part of model training and development.

Content Creation and Communication
- Text Generation: These models can be used to generate creative text formats such as poems, scripts, code, marketing copy, and email drafts.
- Chatbots and Conversational AI: Power conversational interfaces for customer service, virtual assistants, or interactive applications.
- Text Summarization: Generate concise summaries of a text corpus, research papers, or reports.
Research and Education
- Natural Language Processing (NLP) Research: These models can serve as a foundation for researchers to experiment with NLP techniques, develop algorithms, and contribute to the advancement of the field.
- Language Learning Tools: Support interactive language learning experiences, aiding in grammar correction or providing writing practice.
- Knowledge Exploration: Assist researchers in exploring large bodies of text by generating summaries or answering questions about specific topics.

Limitations

Training Data
- The quality and diversity of the training data significantly influence the model's capabilities. Biases or gaps in the training data can lead to limitations in the model's responses.
- The scope of the training dataset determines the subject areas the model can handle effectively.
Context and Task Complexity
- LLMs are better at tasks that can be framed with clear prompts and instructions. Open - ended or highly complex tasks might be challenging.
- A model's performance can be influenced by the amount of context provided (longer context generally leads to better outputs, up to a certain point).
Language Ambiguity and Nuance
- Natural language is inherently complex. LLMs might struggle to grasp subtle nuances, sarcasm, or figurative language.
Factual Accuracy
- LLMs generate responses based on information they learned from their training datasets, but they are not knowledge bases. They may generate incorrect or outdated factual statements.
Common Sense
- LLMs rely on statistical patterns in language. They might lack the ability to apply common sense reasoning in certain situations.

Ethical Considerations and Risks

The development of large language models (LLMs) raises several ethical concerns. In creating an open model, we have careful

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご