Merak-7B-v2-GGML Open-Source Language Model - Supports Indonesian and English Communication Applications

Merak 7B V2 GGML

Developed by asyafiqe

MERAK-7B-V2 is an Indonesian large language model fine-tuned based on Meta Llama-2-7B-Chat-HF, supporting Indonesian and English.

Large Language Model

Transformers

Supports Multiple Languages#Indonesian language optimization #Low-resource fine-tuning #Wikipedia knowledge

Downloads 46

Release Time : 8/7/2023

Model Overview

This model operates on 16GB VRAM using QLoRA technology, fine-tuned with Indonesian Wikipedia articles, suitable for text generation tasks.

Model Features

Indonesian language optimization

Specially fine-tuned for Indonesian, excelling in Indonesian language tasks.

Low-resource operation

Can run on GPUs with 16GB VRAM using QLoRA technology.

Multi-format support

Provides various GGML quantized formats suitable for different hardware environments.

Model Capabilities

Indonesian text generation

English text generation

Question answering system

Content creation

Use Cases

Education

Indonesian history Q&A

Answering questions about Indonesian history

Can accurately answer questions like 'Who authored the Indonesian Declaration of Independence?'

Content creation

Indonesian content generation

Generating Indonesian articles or stories

🚀 MERAK-7B-V2 GGML

These GGML format model files are designed for CPU + GPU inference, offering a powerful solution for various applications.

This README is adapted from TheBloke. These files are in GGML format for MERAK-7B-V2.

GGML files support CPU + GPU inference using llama.cpp and are compatible with numerous libraries and UIs, including:

KoboldCpp: A robust GGML web UI with built - in full GPU acceleration, ideal for storytelling.
LoLLMS Web UI: A great web UI that enables GPU acceleration via the c_transformers backend.
LM Studio: A fully - featured local GUI. It supports full GPU acceleration on macOS and also works on Windows without GPU acceleration.
[text - generation - webui](https://github.com/oobabooga/text - generation - webui): The most popular web UI, though it requires additional steps to enable GPU acceleration via the llama.cpp backend.
ctransformers: A Python library with LangChain support and an OpenAI - compatible AI server.
[llama - cpp - python](https://github.com/abetlen/llama - cpp - python): A Python library with an OpenAI - compatible API server.

✨ Features

Multi - language support: Supports languages such as Indonesian (id) and English (en).
Diverse quantization methods: Offers both original llama.cpp quant methods and new k - quant methods.
Efficient performance: Leveraging QLoRA, it can run with 16 GB VRAM.

📦 Installation

Please ensure you have installed the CUDA driver, Python 3.10, and PyTorch 2 in your system. Then, install the following libraries in the terminal:

pip install bitsandbytes==0.39.1
pip install transformers==4.31.0
pip install peft==0.4.0
pip install accelerate==0.20.3
pip install einops==0.6.1 scipy sentencepiece datasets

💻 Usage Examples

Basic Usage

import torch
from transformers import AutoTokenizer, AutoConfig, AutoModelForCausalLM, BitsAndBytesConfig, LlamaTokenizer
from peft import PeftModel, PeftConfig

model_id = "Ichsan2895/Merak-7B-v2"
config = AutoConfig.from_pretrained(model_id)

BNB_CONFIG = BitsAndBytesConfig(load_in_4bit=True,
                                bnb_4bit_compute_dtype=torch.bfloat16,
                                bnb_4bit_use_double_quant=True,
                                bnb_4bit_quant_type="nf4",
    )

model = AutoModelForCausalLM.from_pretrained(model_id,
                                             quantization_config=BNB_CONFIG,
                                             device_map="auto",
                                             trust_remote_code=True)

tokenizer = LlamaTokenizer.from_pretrained(model_id)

def generate_response(question: str) -> str:
  prompt = f"<|prompt|>{question}\n<|answer|>".strip()

  encoding = tokenizer(prompt, return_tensors='pt').to("cuda")
  with torch.inference_mode():
    outputs = model.generate(input_ids=encoding.input_ids,
                             attention_mask=encoding.attention_mask,
                             eos_token_id=tokenizer.pad_token_id,
                             do_sample=False,
                             num_beams=2,
                             temperature=0.3,
                             repetition_penalty=1.2,
                             max_length=200)

    response = tokenizer.decode(outputs[0], skip_special_tokes=True)

    assistant_start = "<|answer|>"
    response_start = response.find(assistant_start)
return response[response_start + len(assistant_start) :].strip()

prompt = "Siapa penulis naskah proklamasi kemerdekaan Indonesia?"
print(generate_response(prompt))

Advanced Usage

For better answers, avoid using BitsandBytes 4 - bit Quantization, though it requires more VRAM.

import torch
from transformers import AutoTokenizer, AutoConfig, AutoModelForCausalLM, BitsAndBytesConfig, LlamaTokenizer
from peft import PeftModel, PeftConfig

model_id = "Ichsan2895/Merak-7B-v2"
config = AutoConfig.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id,
                                             device_map="auto",
                                             trust_remote_code=True)

tokenizer = LlamaTokenizer.from_pretrained(model_id)

def generate_response(question: str) -> str:
  prompt = f"<|prompt|>{question}\n<|answer|>".strip()

  encoding = tokenizer(prompt, return_tensors='pt').to("cuda")
  with torch.inference_mode():
    outputs = model.generate(input_ids=encoding.input_ids,
                             attention_mask=encoding.attention_mask,
                             eos_token_id=tokenizer.pad_token_id,
                             do_sample=False,
                             num_beams=2,
                             temperature=0.3,
                             repetition_penalty=1.2,
                             max_length=200)

    response = tokenizer.decode(outputs[0], skip_special_tokes=True)

    assistant_start = "<|answer|>"
    response_start = response.find(assistant_start)
return response[response_start + len(assistant_start) :].strip()

prompt = "Siapa penulis naskah proklamasi kemerdekaan Indonesia?"
print(generate_response(prompt))

📚 Documentation

Compatibility

Original llama.cpp quant methods: `q4_0, q4_1, q5_0, q5_1, q8_0`

These methods are guaranteed to be compatible with any UIs, tools, and libraries released since late May. However, they may be phased out soon as they are largely replaced by the new k - quant methods.

New k - quant methods: `q2_K, q3_K_S, q3_K_M, q3_K_L, q4_K_S, q4_K_M, q5_K_S, q6_K`

These new quantization methods are compatible with llama.cpp as of June 6th, commit 2d43387. They are also compatible with recent releases of text - generation - webui, KoboldCpp, llama - cpp - python, ctransformers, rustformers, and most other tools. For compatibility with other tools and libraries, please refer to their documentation.

Explanation of the new k - quant methods

Click to see details

The new methods available are:

GGML_TYPE_Q2_K - "type - 1" 2 - bit quantization in super - blocks containing 16 blocks, each block having 16 weights. Block scales and mins are quantized with 4 bits. This effectively uses 2.5625 bits per weight (bpw).
GGML_TYPE_Q3_K - "type - 0" 3 - bit quantization in super - blocks containing 16 blocks, each block having 16 weights. Scales are quantized with 6 bits. This results in 3.4375 bpw.
GGML_TYPE_Q4_K - "type - 1" 4 - bit quantization in super - blocks containing 8 blocks, each block having 32 weights. Scales and mins are quantized with 6 bits. This ends up using 4.5 bpw.
GGML_TYPE_Q5_K - "type - 1" 5 - bit quantization. It has the same super - block structure as GGML_TYPE_Q4_K, resulting in 5.5 bpw.
GGML_TYPE_Q6_K - "type - 0" 6 - bit quantization. Super - blocks have 16 blocks, each block having 16 weights. Scales are quantized with 8 bits. This ends up using 6.5625 bpw.
GGML_TYPE_Q8_K - "type - 0" 8 - bit quantization. It is only used for quantizing intermediate results. The difference from the existing Q8_0 is that the block size is 256. All 2 - 6 bit dot products are implemented for this quantization type.

Refer to the Provided Files table below to see which files use which methods and how.

Provided files

Name	Quant method	Bits	Use case
Merak - 7B - v2.ggmlv3.q2_K.bin	q2_K	2	New k - quant method. Uses GGML_TYPE_Q4_K for the attention.vw and feed_forward.w2 tensors, GGML_TYPE_Q2_K for the other tensors.
Merak - 7B - v2.ggmlv3.q3_K_L.bin	q3_K_L	3	New k - quant method. Uses GGML_TYPE_Q5_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else GGML_TYPE_Q3_K.
Merak - 7B - v2.ggmlv3.q3_K_M.bin	q3_K_M	3	New k - quant method. Uses GGML_TYPE_Q4_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else GGML_TYPE_Q3_K.
Merak - 7B - v2.ggmlv3.q3_K_S.bin	q3_K_S	3	New k - quant method. Uses GGML_TYPE_Q3_K for all tensors.
Merak - 7B - v2.ggmlv3.q4_0.bin	q4_0	4	Original quant method, 4 - bit.
Merak - 7B - v2.ggmlv3.q4_1.bin	q4_1	4	Original quant method, 4 - bit. Higher accuracy than q4_0 but not as high as q5_0. However, it has quicker inference than q5 models.
Merak - 7B - v2.ggmlv3.q4_K_M.bin	q4_K_M	4	New k - quant method. Uses GGML_TYPE_Q6_K for half of the attention.wv and feed_forward.w2 tensors, else GGML_TYPE_Q4_K.
Merak - 7B - v2.ggmlv3.q4_K_S.bin	q4_K_S	4	New k - quant method. Uses GGML_TYPE_Q4_K for all tensors.
Merak - 7B - v2.ggmlv3.q5_0.bin	q5_0	5	Original quant method, 5 - bit. Higher accuracy, higher resource usage, and slower inference.
Merak - 7B - v2.ggmlv3.q5_1.bin	q5_1	5	Original quant method, 5 - bit. Even higher accuracy, resource usage, and slower inference.
Merak - 7B - v2.ggmlv3.q5_K_M.bin	q5_K_M	5	New k - quant method. Uses GGML_TYPE_Q6_K for half of the attention.wv and feed_forward.w2 tensors, else GGML_TYPE_Q5_K.
Merak - 7B - v2.ggmlv3.q5_K_S.bin	q5_K_S	5	New k - quant method. Uses GGML_TYPE_Q5_K for all tensors.
Merak - 7B - v2.ggmlv3.q6_K.bin	q6_K	6	New k - quant method. Uses GGML_TYPE_Q8_K for all tensors - 6 - bit quantization.
lMerak - 7B - v2.ggmlv3.q8_0.bin	q8_0	8	Original quant method, 8 - bit. Almost indistinguishable from float16. High resource use and slow. Not recommended for most users.

How to run in `text - generation - webui`

Further instructions can be found here: [text - generation - webui/docs/llama.cpp - models.md](https://github.com/oobabooga/text - generation - webui/blob/main/docs/llama.cpp - models.md).

Original model card: 6TH PROTOTYPE OF MERAK - 7B - V2!

Merak - 7B is a Large Language Model for the Indonesian language. This model is based on Meta Llama - 2 - 7B - Chat - HF and fine - tuned using some pre - cleaned Indonesian Wikipedia articles.

Leveraging QLoRA (QLora: Efficient Finetuning of Quantized LLMs), Merak - 7B can run with 16 GB VRAM.

Licensed under Creative Commons - By Attribution - Share Alike - Non Commercial (CC - BY - SA - NC 4.0), Merak - 7B empowers AI enthusiasts and researchers alike.

Big thanks to all my friends and communities that helped build our first model. Feel free to ask me about the model and share the news on your social media.

CHANGELOG

v1: The first Merak - 7B model. We selected and cleaned about 200k ID Wikipedia articles.
v2: A finetuned version of the first Merak - 7B model. We finetuned it again with the same ID Wikipedia articles, except we changed the prompt - style in the questions.

CITATION

@Paper{arXiv,
  author  = {Touvron, et al},
  title   = {Llama 2: Open Foundation and Fine-Tuned Chat Models},
  journal = {arXiv preprint arXiv:2307.09288},
  year    = {2023}
}

@ONLINE{wikidump,
    author = "Wikimedia Foundation",
    title  = "Wikimedia Downloads",
    url    = "https://dumps.wikimedia.org"
}

@inproceedings{wolf-etal-2020-transformers,
    title = "Transformers: State-of-the-Art Natural Language Processing",
    author = "Thomas Wolf and Lysandre Debut and Victor Sanh and Julien Chaumond and Clement Delangue and Anthony Moi and Pierric Cistac and Tim Rault and Rémi Louf and Morgan Funtowicz and Joe Davison and Sam Shleifer and Patrick von Platen and Clara Ma and Yacine Jernite and Julien Plu and Canwen Xu and Teven Le Scao and Sylvain Gugger and Mariama Drame and Quentin Lhoest and Alexander M. Rush",
    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations",
    month = oct,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.emnlp-demos.6",
    pages = "38--45"
}

@article{dettmers2023qlora,
  title   = {QLoRA: Efficient Finetuning of Quantized LLMs},
  author  = {Dettmers, Tim and Pagnoni, Artidoro and Holtzman, Ari and Zettlemoyer, Luke},
  journal = {arXiv preprint arXiv:2305.14314},
  year    = {2023}
}

📄 License

The model is licensed under Creative Commons - By Attribution - Share Alike - Non Commercial (CC - BY - SA - NC 4.0).

📋 Information Table

Property	Details
Model Type	llama
Training Data	wikipedia
License	llama2
Pipeline Tag	text - generation
Tags	facebook, meta, pytorch, llama, llama - 2
Language	id, en
Inference	false

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご