DictaLM-2.0 Open-Source Text Generation Model - Free Support for Content Creation Optimization in Hebrew

Dictalm2.0

Developed by dicta-il

DictaLM-2.0 is a 7-billion-parameter pretrained generative text model, optimized for Hebrew, based on an improved Mistral-7B architecture

Large Language Model

Transformers

Supports Multiple LanguagesOpen Source License:Apache-2.0 #Hebrew Optimization #Multilingual Generation #Low Token Compression Rate

Downloads 24.86k

Release Time : 4/10/2024

Model Overview

A generative large language model optimized for Hebrew, enhanced with extended tokenizers and bilingual training to improve Hebrew processing capabilities

Model Features

Hebrew-specific Tokenizer

Added 1,000 Hebrew tokens, improving compression rate from 5.78 tokens/word to 2.76 tokens/word

Bilingual Pretraining

Trained on 190 billion tokens (50% Hebrew + 50% English)

Quantization Support

Provides 4-bit quantized GPTQ and AWQ versions to reduce hardware requirements

Model Capabilities

Hebrew text generation

English text generation

Tense conversion

Language understanding

Use Cases

Language Learning

Verb Tense Conversion

Automatically converts Hebrew verb tenses

Examples demonstrate accurate conversion between past and future tenses

Content Generation

Hebrew Content Creation

Generates various text content compliant with Hebrew grammar

🚀 Adapting LLMs to Hebrew: Unveiling DictaLM 2.0 with Enhanced Vocabulary and Instruction Capabilities

The DictaLM - 2.0 Large Language Model (LLM) is a pretrained generative text model. It has 7 billion parameters and is trained to specialize in Hebrew text. This project aims to adapt large - language models to Hebrew, offering enhanced vocabulary and instruction capabilities.

For full details of this model, please read our release blog post or the technical report. This is the full - precision base model. You can view and access the full collection of base/instruct unquantized/quantized versions of DictaLM - 2.0 here.

🚀 Quick Start

The DictaLM - 2.0 is a powerful tool for Hebrew text generation. You can start using it right away with the provided code examples.

✨ Features

Specialized in Hebrew text generation.
Based on the Mistral - 7B - v0.1 model with extended tokenizer for Hebrew.
Continued pretraining on a large corpus of naturally occurring text (190B tokens, 50% Hebrew and 50% English).

💻 Usage Examples

Basic Usage

from transformers import pipeline
import torch

# This loads the model onto the GPU in bfloat16 precision
model = pipeline('text-generation', 'dicta-il/dictalm2.0', torch_dtype=torch.bfloat16, device_map='cuda')

# Sample few shot examples
prompt = """
עבר: הלכתי
עתיד: אלך

עבר: שמרתי
עתיד: אשמור

עבר: שמעתי
עתיד: אשמע

עבר: הבנתי
עתיד:
"""

print(model(prompt.strip(), do_sample=False, max_new_tokens=8, stop_sequence='\n'))
# [{'generated_text': 'עבר: הלכתי\nעתיד: אלך\n\nעבר: שמרתי\nעתיד: אשמור\n\nעבר: שמעתי\nעתיד: אשמע\n\nעבר: הבנתי\nעתיד: אבין\n\n'}]

Advanced Usage

There are already pre - quantized 4 - bit models using the GPTQ and AWQ methods available for use: DictaLM - 2.0 - AWQ and DictaLM - 2.0 - GPTQ.

For dynamic quantization on the go, here is sample code which loads the model onto the GPU using the bitsandbytes package:

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained('dicta-il/dictalm2.0', torch_dtype=torch.bfloat16, device_map='cuda', load_in_4bit=True)
tokenizer = AutoTokenizer.from_pretrained('dicta-il/dictalm2.0')

prompt = """
עבר: הלכתי
עתיד: אלך

עבר: שמרתי
עתיד: אשמור

עבר: שמעתי
עתיד: אשמע

עבר: הבנתי
עתיד:
"""

encoded = tokenizer(prompt.strip(), return_tensors='pt').to(model.device)
print(tokenizer.batch_decode(model.generate(**encoded, do_sample=False, max_new_tokens=4)))
# ['<s> עבר: הלכתי\nעתיד: אלך\n\nעבר: שמרתי\nעתיד: אשמור\n\nעבר: שמעתי\nעתיד: אשמע\n\nעבר: הבנתי\nעתיד: אבין\n\n']

🔧 Technical Details

DictaLM - 2.0 is based on the Mistral - 7B - v0.1 model with the following changes:

An extended tokenizer with 1,000 injected tokens specifically for Hebrew, increasing the compression rate from 5.78 tokens/word to 2.76 tokens/word.
Continued pretraining on over 190B tokens of naturally occurring text, 50% Hebrew and 50% English.

📄 License

This project is licensed under the Apache - 2.0 license.

📚 Documentation

Notice

DictaLM 2.0 is a pretrained base model and therefore does not have any moderation mechanisms.

Citation

If you use this model, please cite:

@misc{shmidman2024adaptingllmshebrewunveiling,
      title={Adapting LLMs to Hebrew: Unveiling DictaLM 2.0 with Enhanced Vocabulary and Instruction Capabilities}, 
      author={Shaltiel Shmidman and Avi Shmidman and Amir DN Cohen and Moshe Koppel},
      year={2024},
      eprint={2407.07080},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2407.07080}, 
}

Property	Details
Model Type	Pretrained generative text model
Training Data	Over 190B tokens of naturally occurring text (50% Hebrew and 50% English)
Pipeline Tag	text - generation
Tags	pretrained
Inference Parameters	temperature: 0.7
License	Apache - 2.0
Languages Supported	en, he

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご