Model Overview
Model Features
Model Capabilities
Use Cases
🚀 Model Card for GaMS-27B-Instruct
GaMS-2B, GaMS-9B, and GaMS-27B are new, improved, and larger models in the GaMS (Generative Model for Slovene) family. These models are based on Google's Gemma 2 family and have been continuously pre - trained on Slovene, English, and some Croatian, Serbian, and Bosnian corpora.
This is the SFT version of the GaMS-27B model.
🚀 Quick Start
The model can be run through the pipeline
API. Here's how you can get started:
from transformers import pipeline
model_id = "cjvt/GaMS-27B-Instruct"
pline = pipeline(
"text-generation",
model=model_id,
device_map="cuda" # replace with "mps" to run on a Mac device
)
# Example of response generation
message = [{"role": "user", "content": "Kateri je najpomembnejši dogodek v slovenski zgodovini?"}]
response = pline(message, max_new_tokens=512)
print("Model's response:", response[0]["generated_text"][-1]["content"])
# Example of conversation chain
new_message = response[0]["generated_text"]
new_message.append({"role": "user", "content": "Lahko bolj podrobno opišeš ta dogodek?"})
response = pline(new_message, max_new_tokens=1024)
print("Model's response:", response[0]["generated_text"][-1]["content"])
✨ Features
- Multilingual Support: Supports Slovene, English (primary), and Croatian, Bosnian, and Serbian (secondary). It may also work for other languages supported by Gemma 2.
- Based on Strong Foundation: Built on Google's Gemma 2 family, with continuous pre - training on diverse corpora.
- SFT Version: The SFT version of GaMS - 27B model provides better performance in specific scenarios.
📦 Installation
No specific installation steps are provided in the original README. If you want to use the model, you can follow the code examples above which use the transformers
library. You may need to install the transformers
library if you haven't already:
pip install transformers
💻 Usage Examples
Basic Usage
from transformers import pipeline
model_id = "cjvt/GaMS-27B-Instruct"
pline = pipeline(
"text-generation",
model=model_id,
device_map="cuda" # replace with "mps" to run on a Mac device
)
# Example of response generation
message = [{"role": "user", "content": "Kateri je najpomembnejši dogodek v slovenski zgodovini?"}]
response = pline(message, max_new_tokens=512)
print("Model's response:", response[0]["generated_text"][-1]["content"])
Advanced Usage
# For multi GPU inference
from transformers import pipeline
model_id = "cjvt/GaMS-27B-Instruct"
pline = pipeline(
"text-generation",
model=model_id,
device_map="auto"
)
# Example of response generation
message = [{"role": "user", "content": "Kateri je najpomembnejši dogodek v slovenski zgodovini?"}]
response = pline(message, max_new_tokens=512)
print("Model's response:", response[0]["generated_text"][-1]["content"])
# Example of conversation chain
new_message = response[0]["generated_text"]
new_message.append({"role": "user", "content": "Lahko bolj podrobno opišeš ta dogodek?"})
response = pline(new_message, max_new_tokens=1024)
print("Model's response:", response[0]["generated_text"][-1]["content"])
📚 Documentation
Basic information
Property | Details |
---|---|
Developed by | A team of researchers at the University of Ljubljana, Faculty for Computer and Information Science. Team members: Domen Vreš, Iztok Lebar Bajec, Tjaša Arčon, Gašper Jelovčan and Marko Robnik - Šikonja. |
Languages | Slovene, English (primary), Croatian, Bosnian and Serbian (secondary). The model might also work for other languages supported by Gemma 2, even though it was not continually pretrained on them. |
Base model | cjvt/GaMS-27B |
License | Gemma |
Acknowledgment
The model was developed within the PoVeJMo research program (Adaptive Natural Language Processing with Large Language Models), particularly within the research project titled SloLLaMai -- Open - access computationally efficient models for Slovenian. The program is funded within the Recovery and Resilience Plan by the Slovenian Research and Innovation Agency (ARIS) and NextGenerationEU. The authors also acknowledge the financial support from the Slovenian Research and Innovation Agency (research core funding No. P6 - 0411 -- Language Resources and Technologies for Slovene).
We thank everyone who worked on data collection and preparation, enabling us to train our model. Special thanks go to Nikola Ljubešić, Taja Kuzman, Tjaša Arčon, Jaka Čibej, Simon Krek, Tomaž Erjavec, Iztok Kosem and Tomaž Savodnik.
Data
CPT Data
Model was continually pre - trained in two stages. In the first stage, parallel English - Slovene (and Croatian in some cases) corpora was used to align the languages. In the second stage, the model was trained on separate English, Slovene, Croatian, Bosnian and Serbian corpora.
Parallel alignment corpora
Corpus | Alignment level | # Tokens | Percentage |
---|---|---|---|
KAS Abstracts | Document level | 31 M | 1.65 % |
DGT | Separate documents | 697 M | 36.56 % |
MaCoCu Parallel | Separate documents | 430 M | 22.53 % |
CC - News | Paragraph level | 749 M | 39.25 % |
Total | 1.91 B |
Explanation of each alignment level:
- Document level: Parallel documents were concatenated into a single document
- Separate documents: Parallel documents were not explicitly aligned
- Paragraph level: Paragraphs of parallel documents were concatenated (the first paragraph of Slovene/English document was followed by the first paragraph in the other language, which was then followed by the second paragraph in the first language and so on)
Second stage corpora
Corpus | Language | # Tokens | Percentage |
---|---|---|---|
KAS | Slovene | 2.77 B | 20.34 % |
MetaFida* | Slovene | 4.66 B | 34.18 % |
Wikipedia - En (Date: January 23rd 2025) | English | 5.45 B | 39.99 % |
Wikipedia - Sl (Date: January 1st 2025) | Slovene | 0.16 B | 1.19 % |
Wikipedia - Hr (Date: January 1st 2025) | Croatian | 0.15 B | 1.13 % |
Wikipedia - Bs (Date: January 1st 2025) | Bosnian | 0.07 B | 0.50 % |
Wikipedia - Sr - Latin* | Serbian | 0.36 B | 2.68 % |
Total | 13.62 B |
Remarks:
- The following corpora was excluded from MetaFida: dgt15_sl, classlawiki_sl, tweet_sl, janes_tweet, janes_forum, janes_news
- Serbian Wikipedia was converted from Cyrillic to Latin
SFT Data
Our training data for SFT consisted out of approximately 25,000 training and 1500 validation examples. The dataset was a mixture of following datasets:
- GaMS - Instruct - GEN 1.0
- GaMS - Instruct - DH 1.0: 3000 randomly selected examples were chosen from this dataset
- GaMS - Instruct - MED 1.0: 3000 randomly selected examples were chosen from this dataset
- Parallel corpus EN - SL RSDO4 2.0: additional filtering was done on this corpus. First we ran FastText language identification using [NeMo Curator](https://docs.nvidia.com/nemo - framework/user - guide/latest/datacuration/languageidentification.html) and kept only the examples were source was detected as English and target as Slovene. Next, we ran [COMET](https://huggingface.co/Unbabel/wmt23 - cometkiwi - da - xxl) model to evaluate translations. We kept only the examples with COMET scores higher than 0.945 (approximatelly 8000 examples).
- Aya Dataset: only English and Serbian examples were taken from this dataset. Serbian examples were converted from Cyrillic to Latin.
- Math competitions: We took PDFs from Slovene national math competitions between years 2001 and 2010. We extracted text from PDFs using [olmOCR](https://huggingface.co/allenai/olmOCR - 7B - 0225 - preview) and manually corrected the extracted text. This gave us around 150 solved math problems.
Training
The model was trained on the Booster partition of Leonardo HPC.
CPT
We continually pretrained the model using NVIDIA NeMo 2.0 framework. The model was trained in BF16 - Mixed precision using tensor parallelism across 8 GPUs, sequence parallelism, and activation recomputation. The model was trained across 32 nodes, each containing 4 A100 64GB GPUs. The parallel alignment training took approximately 8 hours and second stage took approximately 110 hours.
The model was trained using a cosine learning rate scheduler with linear warmup and the following hyperparameters.
Parallel alignment:
- warmup steps: 150
- minimal learning rate: 5e - 6
- maximal learning rate: 2e - 5
- constant steps: 0
- batch size: 512 (4 million tokens)
Second stage:
- warmup steps: 500
- minimal learning rate: 5e - 6
- maximal learning rate: 5e - 5
- constant steps: 100
- batch size: 512 (4 million tokens)
SFT
As a contrast to 2B and 9B models, the 27B was supervised fine - tuned using NeMo framework, which enables easier scalng. The model was trained in BF16 precision, using tensor parallelism to split it across 4 GPUs, sequence parallelism, and activation recomputation. The model was trained on 8 nodes with 4 A100 64 GB GPU.
The model was tuned using a cosine learning rate scheduler with linear warmup and the following hyperparameters:
- number of epochs: training was done on 5 epochs, but the best performing model according to validation loss was obtained after the second epoch, so we kept that model
- batch size: 128
- warmup steps: 150
- minimal learning rate: 1e - 8
- maximal learning rate: 5e - 6
- constant steps: 0
Evaluation
The models were evaluated using Slovene SuperGLUE collection of classification tasks on SloBench. Instruct version of the model was also evaluated on translation from English to Slovene and from Slovene to English. Additionally, we evaluated our models on [Slovenian - LLM - Eval](https://huggingface.co/datasets/cjvt/slovenian - llm - eval).
Code for evaluation:
- SloBench tasks
- [Slovenian - LLM - Eval](https://github.com/SloLama/slovenian - llm - eval)
Slovenian - LLM - Eval results
Comparison between GaMS models, base Gemma 2 models and SlovenianGPT (open source model for Slovene based on Mistral 7B) is shown in the figure below. All models were evaluated in 0 - shot scenario.
Slobench Results
GaMS 2B, 9B, 27B and 27B - Instruct models were evaluated in 3 - shot scenario, except for MultiRC and translation tasks, where 0 - shot was used. GaMS - 2B - Instruct and GaMS - 9B - Instruct were evaluated in 0 - shot scenarion on all tasks. We used guided decoding to ensure the correct format of the responses.
Slovene SuperGLUE
Rank | Title | Average | BoolQ Accuracy | CB Accuracy | CB F1 Score | CB Average | COPA Accuracy | MultiRC EM | MultiRC F1a Score | MultiRC Average | RTE Accuracy | WSC Accuracy |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | GaMS - 27B | 0.7601 | 0.8333 | 0.6440 | 0.5864 | 0.6152 | 0.9540 | 0.3904 | 0.7504 | 0.5704 | 0.7931 | 0.7945 |
2 | PrešernGPT 0.1 | 0.7568 | 0.8333 | 0.8520 | 0.5868 | 0.7194 | 0.9740 | 0.4985 | 0.8061 | 0.6523 | 0.8276 | 0.5342 |
3 | Gemma 2 27B | 0.7546 | 0.8333 | 0.6680 | 0.5972 | 0.6326 | 0.9140 | 0.4174 | 0.7295 | 0.5735 | 0.8276 | 0.7466 |
4 | GaMS - 9B | 0.7309 | 0.7000 | 0.8400 | 0.7955 | 0.8178 | 0.9000 | 0.3243 | 0.6551 | 0.4897 | 0.7931 | 0.6849 |
5 | GaMS - 27B - Instruct | 0.7038 | 0.8333 | 0.7200 | 0.6322 | 0.6761 | 0.9400 | 0.0511 | 0.5813 | 0.3162 | 0.7931 | 0.6644 |
6 | GaMS - 9B - Instruct (0 - shot) | 0.6997 | 0.8000 | 0.7960 | 0.7128 | 0.7544 | 0.8140 | 0.0721 | 0.6174 | 0.3447 | 0.7931 | 0.6918 |
7 | Gemma 2 9B | 0.6980 | 0.8333 | 0.8240 | 0.5683 | 0.6962 | 0.8700 | 0.2282 | 0.5310 | 0.3796 | 0.7241 | 0.6849 |
9 | CroSloEngual BERT | 0.6078 | 0.7333 | 0.7920 | 0.7437 | 0.7679 | 0.5720 | 0.0931 | 0.5241 | 0.3086 | 0.6552 | 0.6096 |
12 | SlovenianGPT - Chat | 0.5078 | 0.7333 | 0.3920 | 0.3829 | 0.3874 | 0.6840 | 0.2432 | 0.4944 | 0.3688 | 0.5172 | 0.3562 |
13 | Gemma 2 2B | 0.4876 | 0.6333 | 0.4520 | 0.2123 | 0.3321 | 0.5180 | 0.1471 | 0.4419 | 0.2945 | 0.5862 | 0.5616 |
14 | GaMS - 2B | 0.4790 | 0.5667 | 0.6080 | 0.4880 | 0.5480 | 0.5240 | 0.0631 | 0.5234 | 0.2932 | 0.5517 | 0.3904 |
15 | GaMS - 2B - Instruct (0 - shot) | 0.4608 | 0.6667 | 0.5120 | 0.2611 | 0.3866 | 0.5000 | 0.0420 | 0.5377 | 0.2899 | 0.5517 | 0.3699 |
16 | GaMS - 1B | 0.4604 | 0.5000 | 0.6200 | 0.4565 | 0.5382 | 0.4920 | 0.1351 | 0.2675 | 0.2013 | 0.4828 | 0.5479 |
17 | GaMS - 1B - Chat | 0.4570 | 0.8000 | 0.4880 | 0.3023 | 0.3951 | 0.4840 | 0.1081 | 0.2428 | 0.1755 | 0.5172 | 0.3692 |
English to Slovene translation (first 11 models on the benchmark)
Rank | Title | BERT score | BLEU (avg) | METEOR (avg) | CHRF (avg) | BLEU (corpus) | CHRF (corpus) |
---|---|---|---|---|---|---|---|
1 | DeepL Translator | 0.8812 | 0.3153 | 0.5902 | 0.6205 | 0.3599 | 0.6205 |
2 | gemini - 1.5 - pro | 0.8791 | 0.3124 | 0.5895 | 0.6176 | 0.3569 | 0.6176 |
3 | Sonnet 3.5 | 0.8789 | 0.3059 | 0.5783 | 0.6204 | 0.3442 | 0.6204 |
4 | gpt - 4o | 0.8784 | 0.2958 | 0.5811 | 0.6138 | 0.3379 | 0.6138 |
5 | EuroLLM - 9B - Instruct | 0.8741 | 0.2927 | 0.5792 | 0.6055 | 0.3273 | 0.6055 |
6 | GaMS - 27B - Instruct | 0.8734 | 0.2866 | 0.5688 | 0.5986 | 0.3246 | 0.5986 |
7 | seamless - m4t - v2 - large | 0.8731 | 0.2780 | 0.5599 | 0.5947 | 0.3085 | 0.5947 |
8 | GaMS - 9B - Instruct | 0.8713 | 0.2773 | 0.5616 | 0.5928 | 0.3209 | 0.5928 |
9 | Zlatorog | 0.8706 | 0.2834 | 0.5633 | 0.6014 | 0.2903 | 0.6014 |
10 | RSDO - DS4 - NMT 1.2.2 | 0.8705 | 0.2794 | 0.5634 | 0.5956 | 0.3226 | 0.5956 |
10 | META LLAMA 3.1 405B | 0.8705 | 0.2637 | 0.5497 | 0.5930 | 0.3063 | 0.5930 |
12 | RSDO - DS4 - NMT 1.2 | 0.8698 | 0.2781 | 0.5602 | 0.5970 | 0.3177 | 0.5970 |
Slovene to English translation (first 10 models on the benchmark)
Rank | Title | BERT score | BLEU (avg) | METEOR (avg) | CHRF (avg) | BLEU (corpus) | CHRF (corpus) |
---|---|---|---|---|---|---|---|
1 | gpt - 4o | 0.9496 | 0.3161 | 0.6655 | 0.6297 | 0.3496 | 0.6297 |
2 | gemini - 1.5 - pro | 0.9489 | 0.3117 | 0.6560 | 0.6237 | 0.3502 | 0.6237 |
3 | gpt - 4o - mini | 0.9466 | 0.2976 | 0.6493 | 0.6197 | 0.3328 | 0.6197 |
4 | GaMS - 27B - Instruct | 0.9455 | 0.2836 | 0.6270 | 0.5972 | 0.3200 | 0.5972 |
5 | GaMS - 9B - Instruct | 0.9454 | 0.2821 | 0.6275 | 0.6018 | 0.3141 | 0.6018 |
6 | ChatGPTv1 | 0.9449 | 0.2852 | 0.6415 | 0.6096 | 0.3171 | 0.6096 |
7 | RSDO - DS4 - NMT 1.2.4 | 0.9434 | 0.2839 | 0.6227 | 0.5967 | 0.3290 | 0.5967 |
8 | RSDO - DS4 - NMT 1.2.6 | 0.9433 | 0.2832 | 0.6207 | 0.5944 | 0.3295 | 0.5944 |
9 | RSDO - DS4 - NMT 1.2.2 | 0.9431 | 0.2785 | 0.6184 | 0.5933 | 0.3240 | 0.5933 |
9 | RSDO - DS4 - NMT 1.2 | 0.9431 | 0.2805 | 0.6201 | 0.5941 | 0.3231 | 0.5941 |
11 | eTranslation SLEN | 0.9414 | 0.2729 | 0.6175 | 0.5913 | 0.3119 | 0.5913 |
🔧 Technical Details
Model Architecture
The model is based on Google's Gemma 2 family and has been continuously pre - trained on diverse corpora. The training process involves multiple stages and parallelism techniques to improve performance.
Training Process
- CPT: The model was trained in two stages using the NVIDIA NeMo 2.0 framework. Tensor parallelism, sequence parallelism, and activation recomputation were used.
- SFT: The 27B model was supervised fine - tuned using the NeMo framework, with specific hyperparameters for better performance.
📄 License
The model is licensed under Gemma.
Usage and Limitations (taken from Gemma 2)
Intended Usage
Open Large Language Models (LLMs) have a wide range of applications across various industries and domains. The following list of potential uses is not comprehensive. The purpose of this list is to provide contextual information about the possible use - cases that the model creators considered as part of model training and development.
- Content Creation and Communication
- Text Generation: These models can be used to generate creative text formats such as poems, scripts, code, marketing copy, and email drafts.
- Chatbots and Conversational AI: Power conversational interfaces for customer service, virtual assistants, or interactive applications.
- Text Summarization: Generate concise summaries of a text corpus, research papers, or reports.
- Research and Education
- Natural Language Processing (NLP) Research: These models can serve as a foundation for researchers to experiment with NLP techniques, develop algorithms, and contribute to the advancement of the field.
- Language Learning Tools: Support interactive language learning experiences, aiding in grammar correction or providing writing practice.
- Knowledge Exploration: Assist researchers in exploring large bodies of text by generating summaries or answering questions about specific topics.
Limitations
- Training Data
- The quality and diversity of the training data significantly influence the model's capabilities. Biases or gaps in the training data can lead to limitations in the model's responses.
- The scope of the training dataset determines the subject areas the model can handle effectively.
- Context and Task Complexity
- LLMs are better at tasks that can be framed with clear prompts and instructions. Open - ended or highly complex tasks might be challenging.
- A model's performance can be influenced by the amount of context provided (longer context generally leads to better outputs, up to a certain point).
- Language Ambiguity and Nuance
- Natural language is inherently complex. LLMs might struggle to grasp subtle nuances, sarcasm, or figurative language.
- Factual Accuracy
- LLMs generate responses based on information they learned from their training datasets, but they are not knowledge bases. They may generate incorrect or outdated factual statements.
- Common Sense
- LLMs rely on statistical patterns in language. They might lack the ability to apply common sense reasoning in certain situations.
Ethical Considerations and Risks
The development of large language models (LLMs) raises several ethical concerns. In creating an open model, we have careful

