
Model Overview
Model Features
Model Capabilities
Use Cases
🚀 MERAK-7B-V2 GGML
These GGML format model files are designed for CPU + GPU inference, offering a powerful solution for various applications.
This README is adapted from TheBloke. These files are in GGML format for MERAK-7B-V2.
GGML files support CPU + GPU inference using llama.cpp and are compatible with numerous libraries and UIs, including:
- KoboldCpp: A robust GGML web UI with built - in full GPU acceleration, ideal for storytelling.
- LoLLMS Web UI: A great web UI that enables GPU acceleration via the c_transformers backend.
- LM Studio: A fully - featured local GUI. It supports full GPU acceleration on macOS and also works on Windows without GPU acceleration.
- [text - generation - webui](https://github.com/oobabooga/text - generation - webui): The most popular web UI, though it requires additional steps to enable GPU acceleration via the llama.cpp backend.
- ctransformers: A Python library with LangChain support and an OpenAI - compatible AI server.
- [llama - cpp - python](https://github.com/abetlen/llama - cpp - python): A Python library with an OpenAI - compatible API server.
✨ Features
- Multi - language support: Supports languages such as Indonesian (
id
) and English (en
). - Diverse quantization methods: Offers both original llama.cpp quant methods and new k - quant methods.
- Efficient performance: Leveraging QLoRA, it can run with 16 GB VRAM.
📦 Installation
Please ensure you have installed the CUDA driver, Python 3.10, and PyTorch 2 in your system. Then, install the following libraries in the terminal:
pip install bitsandbytes==0.39.1
pip install transformers==4.31.0
pip install peft==0.4.0
pip install accelerate==0.20.3
pip install einops==0.6.1 scipy sentencepiece datasets
💻 Usage Examples
Basic Usage
import torch
from transformers import AutoTokenizer, AutoConfig, AutoModelForCausalLM, BitsAndBytesConfig, LlamaTokenizer
from peft import PeftModel, PeftConfig
model_id = "Ichsan2895/Merak-7B-v2"
config = AutoConfig.from_pretrained(model_id)
BNB_CONFIG = BitsAndBytesConfig(load_in_4bit=True,
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
)
model = AutoModelForCausalLM.from_pretrained(model_id,
quantization_config=BNB_CONFIG,
device_map="auto",
trust_remote_code=True)
tokenizer = LlamaTokenizer.from_pretrained(model_id)
def generate_response(question: str) -> str:
prompt = f"<|prompt|>{question}\n<|answer|>".strip()
encoding = tokenizer(prompt, return_tensors='pt').to("cuda")
with torch.inference_mode():
outputs = model.generate(input_ids=encoding.input_ids,
attention_mask=encoding.attention_mask,
eos_token_id=tokenizer.pad_token_id,
do_sample=False,
num_beams=2,
temperature=0.3,
repetition_penalty=1.2,
max_length=200)
response = tokenizer.decode(outputs[0], skip_special_tokes=True)
assistant_start = "<|answer|>"
response_start = response.find(assistant_start)
return response[response_start + len(assistant_start) :].strip()
prompt = "Siapa penulis naskah proklamasi kemerdekaan Indonesia?"
print(generate_response(prompt))
Advanced Usage
For better answers, avoid using BitsandBytes 4 - bit Quantization, though it requires more VRAM.
import torch
from transformers import AutoTokenizer, AutoConfig, AutoModelForCausalLM, BitsAndBytesConfig, LlamaTokenizer
from peft import PeftModel, PeftConfig
model_id = "Ichsan2895/Merak-7B-v2"
config = AutoConfig.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id,
device_map="auto",
trust_remote_code=True)
tokenizer = LlamaTokenizer.from_pretrained(model_id)
def generate_response(question: str) -> str:
prompt = f"<|prompt|>{question}\n<|answer|>".strip()
encoding = tokenizer(prompt, return_tensors='pt').to("cuda")
with torch.inference_mode():
outputs = model.generate(input_ids=encoding.input_ids,
attention_mask=encoding.attention_mask,
eos_token_id=tokenizer.pad_token_id,
do_sample=False,
num_beams=2,
temperature=0.3,
repetition_penalty=1.2,
max_length=200)
response = tokenizer.decode(outputs[0], skip_special_tokes=True)
assistant_start = "<|answer|>"
response_start = response.find(assistant_start)
return response[response_start + len(assistant_start) :].strip()
prompt = "Siapa penulis naskah proklamasi kemerdekaan Indonesia?"
print(generate_response(prompt))
📚 Documentation
Compatibility
Original llama.cpp quant methods: q4_0, q4_1, q5_0, q5_1, q8_0
These methods are guaranteed to be compatible with any UIs, tools, and libraries released since late May. However, they may be phased out soon as they are largely replaced by the new k - quant methods.
New k - quant methods: q2_K, q3_K_S, q3_K_M, q3_K_L, q4_K_S, q4_K_M, q5_K_S, q6_K
These new quantization methods are compatible with llama.cpp as of June 6th, commit 2d43387
. They are also compatible with recent releases of text - generation - webui, KoboldCpp, llama - cpp - python, ctransformers, rustformers, and most other tools. For compatibility with other tools and libraries, please refer to their documentation.
Explanation of the new k - quant methods
Click to see details
The new methods available are:
- GGML_TYPE_Q2_K - "type - 1" 2 - bit quantization in super - blocks containing 16 blocks, each block having 16 weights. Block scales and mins are quantized with 4 bits. This effectively uses 2.5625 bits per weight (bpw).
- GGML_TYPE_Q3_K - "type - 0" 3 - bit quantization in super - blocks containing 16 blocks, each block having 16 weights. Scales are quantized with 6 bits. This results in 3.4375 bpw.
- GGML_TYPE_Q4_K - "type - 1" 4 - bit quantization in super - blocks containing 8 blocks, each block having 32 weights. Scales and mins are quantized with 6 bits. This ends up using 4.5 bpw.
- GGML_TYPE_Q5_K - "type - 1" 5 - bit quantization. It has the same super - block structure as GGML_TYPE_Q4_K, resulting in 5.5 bpw.
- GGML_TYPE_Q6_K - "type - 0" 6 - bit quantization. Super - blocks have 16 blocks, each block having 16 weights. Scales are quantized with 8 bits. This ends up using 6.5625 bpw.
- GGML_TYPE_Q8_K - "type - 0" 8 - bit quantization. It is only used for quantizing intermediate results. The difference from the existing Q8_0 is that the block size is 256. All 2 - 6 bit dot products are implemented for this quantization type.
Refer to the Provided Files table below to see which files use which methods and how.
Provided files
Name | Quant method | Bits | Use case |
---|---|---|---|
Merak - 7B - v2.ggmlv3.q2_K.bin | q2_K | 2 | New k - quant method. Uses GGML_TYPE_Q4_K for the attention.vw and feed_forward.w2 tensors, GGML_TYPE_Q2_K for the other tensors. |
Merak - 7B - v2.ggmlv3.q3_K_L.bin | q3_K_L | 3 | New k - quant method. Uses GGML_TYPE_Q5_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else GGML_TYPE_Q3_K. |
Merak - 7B - v2.ggmlv3.q3_K_M.bin | q3_K_M | 3 | New k - quant method. Uses GGML_TYPE_Q4_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else GGML_TYPE_Q3_K. |
Merak - 7B - v2.ggmlv3.q3_K_S.bin | q3_K_S | 3 | New k - quant method. Uses GGML_TYPE_Q3_K for all tensors. |
Merak - 7B - v2.ggmlv3.q4_0.bin | q4_0 | 4 | Original quant method, 4 - bit. |
Merak - 7B - v2.ggmlv3.q4_1.bin | q4_1 | 4 | Original quant method, 4 - bit. Higher accuracy than q4_0 but not as high as q5_0. However, it has quicker inference than q5 models. |
Merak - 7B - v2.ggmlv3.q4_K_M.bin | q4_K_M | 4 | New k - quant method. Uses GGML_TYPE_Q6_K for half of the attention.wv and feed_forward.w2 tensors, else GGML_TYPE_Q4_K. |
Merak - 7B - v2.ggmlv3.q4_K_S.bin | q4_K_S | 4 | New k - quant method. Uses GGML_TYPE_Q4_K for all tensors. |
Merak - 7B - v2.ggmlv3.q5_0.bin | q5_0 | 5 | Original quant method, 5 - bit. Higher accuracy, higher resource usage, and slower inference. |
Merak - 7B - v2.ggmlv3.q5_1.bin | q5_1 | 5 | Original quant method, 5 - bit. Even higher accuracy, resource usage, and slower inference. |
Merak - 7B - v2.ggmlv3.q5_K_M.bin | q5_K_M | 5 | New k - quant method. Uses GGML_TYPE_Q6_K for half of the attention.wv and feed_forward.w2 tensors, else GGML_TYPE_Q5_K. |
Merak - 7B - v2.ggmlv3.q5_K_S.bin | q5_K_S | 5 | New k - quant method. Uses GGML_TYPE_Q5_K for all tensors. |
Merak - 7B - v2.ggmlv3.q6_K.bin | q6_K | 6 | New k - quant method. Uses GGML_TYPE_Q8_K for all tensors - 6 - bit quantization. |
lMerak - 7B - v2.ggmlv3.q8_0.bin | q8_0 | 8 | Original quant method, 8 - bit. Almost indistinguishable from float16. High resource use and slow. Not recommended for most users. |
How to run in text - generation - webui
Further instructions can be found here: [text - generation - webui/docs/llama.cpp - models.md](https://github.com/oobabooga/text - generation - webui/blob/main/docs/llama.cpp - models.md).
Original model card: 6TH PROTOTYPE OF MERAK - 7B - V2!
Merak - 7B is a Large Language Model for the Indonesian language. This model is based on Meta Llama - 2 - 7B - Chat - HF and fine - tuned using some pre - cleaned Indonesian Wikipedia articles.
Leveraging QLoRA (QLora: Efficient Finetuning of Quantized LLMs), Merak - 7B can run with 16 GB VRAM.
Licensed under Creative Commons - By Attribution - Share Alike - Non Commercial (CC - BY - SA - NC 4.0), Merak - 7B empowers AI enthusiasts and researchers alike.
Big thanks to all my friends and communities that helped build our first model. Feel free to ask me about the model and share the news on your social media.
CHANGELOG
- v1: The first Merak - 7B model. We selected and cleaned about 200k ID Wikipedia articles.
- v2: A finetuned version of the first Merak - 7B model. We finetuned it again with the same ID Wikipedia articles, except we changed the prompt - style in the questions.
CITATION
@Paper{arXiv,
author = {Touvron, et al},
title = {Llama 2: Open Foundation and Fine-Tuned Chat Models},
journal = {arXiv preprint arXiv:2307.09288},
year = {2023}
}
@ONLINE{wikidump,
author = "Wikimedia Foundation",
title = "Wikimedia Downloads",
url = "https://dumps.wikimedia.org"
}
@inproceedings{wolf-etal-2020-transformers,
title = "Transformers: State-of-the-Art Natural Language Processing",
author = "Thomas Wolf and Lysandre Debut and Victor Sanh and Julien Chaumond and Clement Delangue and Anthony Moi and Pierric Cistac and Tim Rault and Rémi Louf and Morgan Funtowicz and Joe Davison and Sam Shleifer and Patrick von Platen and Clara Ma and Yacine Jernite and Julien Plu and Canwen Xu and Teven Le Scao and Sylvain Gugger and Mariama Drame and Quentin Lhoest and Alexander M. Rush",
booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations",
month = oct,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.emnlp-demos.6",
pages = "38--45"
}
@article{dettmers2023qlora,
title = {QLoRA: Efficient Finetuning of Quantized LLMs},
author = {Dettmers, Tim and Pagnoni, Artidoro and Holtzman, Ari and Zettlemoyer, Luke},
journal = {arXiv preprint arXiv:2305.14314},
year = {2023}
}
📄 License
The model is licensed under Creative Commons - By Attribution - Share Alike - Non Commercial (CC - BY - SA - NC 4.0).
📋 Information Table
Property | Details |
---|---|
Model Type | llama |
Training Data | wikipedia |
License | llama2 |
Pipeline Tag | text - generation |
Tags | facebook, meta, pytorch, llama, llama - 2 |
Language | id, en |
Inference | false |

