đ Adapting LLMs to Hebrew: Unveiling DictaLM 2.0 with Enhanced Vocabulary and Instruction Capabilities
The DictaLM-2.0-Instruct Large Language Model (LLM) is an instruct fine - tuned version of the [DictaLM - 2.0](https://huggingface.co/dicta - il/dictalm2.0) generative model. It uses a variety of conversation datasets. This model is designed to adapt large language models to Hebrew, with enhanced vocabulary and instruction capabilities.
For full details of this model, please read our [release blog post](https://dicta.org.il/dicta - lm) or the technical report.
This is the instruct - tuned full - precision model for chat. You can try the model out on a live demo [here](https://huggingface.co/spaces/dicta - il/dictalm2.0 - instruct - demo).
You can view and access the full collection of base/instruct unquantized/quantized versions of DictaLM - 2.0
[here](https://huggingface.co/collections/dicta - il/dicta - lm - 20 - collection - 661bbda397df671e4a430c27).
⨠Features
Instruction Format
In order to leverage instruction fine - tuning, your prompt should be surrounded by [INST]
and [/INST]
tokens. The very first instruction should begin with a begin of sentence id. The next instructions should not. The assistant generation will be ended by the end - of - sentence token id.
E.g.
text = """<s>[INST] ×××× ×¨××× ×××× ×ĸ×××? [/INST]
×××, ×× × ×× ×××× ××× ××פ××Ē ×××Ĩ ××××× ×Ą××× ×ר×. ×× ××ץ××Ŗ ×××××§ ××Ē ×××××Ē ×× ××× × ×Š× ××ĸ× ×××Ļ××Ĩ ××× ×× ×Š×× × ×××Š× ×××××!</s>[INST] ××× ×׊ ×× ××Ē××× ×× ××××× ×? [/INST]"
This format is available as a chat template via the apply_chat_template()
method.
đģ Usage Examples
Basic Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
device = "cuda"
model = AutoModelForCausalLM.from_pretrained("dicta-il/dictalm2.0-instruct", torch_dtype=torch.bfloat16, device_map=device)
tokenizer = AutoTokenizer.from_pretrained("dicta-il/dictalm2.0-instruct")
messages = [
{"role": "user", "content": "×××× ×¨××× ×××× ×ĸ×××?"},
{"role": "assistant", "content": "×××, ×× × ×× ×××× ××× ××פ××Ē ×××Ĩ ××××× ×Ą××× ×ר×. ×× ××ץ××Ŗ ×××××§ ××Ē ×××××Ē ×× ××× × ×Š× ××ĸ× ×××Ļ××Ĩ ××× ×× ×Š×× × ×××Š× ×××××!"},
{"role": "user", "content": "××× ×׊ ×× ××Ē××× ×× ××××× ×?"}
]
encoded = tokenizer.apply_chat_template(messages, return_tensors="pt").to(device)
generated_ids = model.generate(encoded, max_new_tokens=50, do_sample=True)
decoded = tokenizer.batch_decode(generated_ids)
print(decoded[0])
đ§ Technical Details
Model Architecture
DictaLM - 2.0 - Instruct follows the [Zephyr - 7B - beta](https://huggingface.co/HuggingFaceH4/zephyr - 7b - beta) recipe for fine - tuning an instruct model, with an extended instruct dataset for Hebrew.
đ Documentation
Limitations
The DictaLM 2.0 Instruct model is a demonstration that the base model can be fine - tuned to achieve compelling performance.
It does not have any moderation mechanisms. We're looking forward to engaging with the community on ways to
make the model finely respect guardrails, allowing for deployment in environments requiring moderated outputs.
đ License
This model is licensed under the Apache - 2.0 license.
đ Citation
If you use this model, please cite:
@misc{shmidman2024adaptingllmshebrewunveiling,
title={Adapting LLMs to Hebrew: Unveiling DictaLM 2.0 with Enhanced Vocabulary and Instruction Capabilities},
author={Shaltiel Shmidman and Avi Shmidman and Amir DN Cohen and Moshe Koppel},
year={2024},
eprint={2407.07080},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2407.07080},
}