🚀 Jais Family Model Card
The Jais family of models is a comprehensive series of bilingual English - Arabic large language models (LLMs). These models are optimized for Arabic while maintaining strong English capabilities. This release includes two types of foundation models: models pre - trained from scratch (jais - family - *
) and models pre - trained adaptively from Llama - 2 (jais - adapted - *
). In total, 20 models across 8 sizes, with parameters ranging from 590M to 70B, are introduced, trained on up to 1.6T tokens of Arabic, English, and code data. All pre - trained models in this series are instruction fine - tuned (*-chat
) for dialog using a curated mix of Arabic and English instruction data.
✨ Features
- Bilingual Excellence: Optimized for Arabic with strong English capabilities.
- Diverse Model Sizes: 20 models across 8 sizes, from 590M to 70B parameters.
- Extensive Training Data: Trained on up to 1.6T tokens of Arabic, English, and code data.
- Instruction Fine - Tuning: All pre - trained models are instruction fine - tuned for dialog.
📦 Installation
No installation steps are provided in the original document, so this section is skipped.
💻 Usage Examples
Basic Usage
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
model_path = "inceptionai/jais-family-2p7b"
device = "cuda" if torch.cuda.is_available() else "cpu"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path, device_map="auto", trust_remote_code=True)
def get_response(text, tokenizer=tokenizer, model=model):
input_ids = tokenizer(text, return_tensors="pt").input_ids
inputs = input_ids.to(device)
input_len = inputs.shape[-1]
generate_ids = model.generate(
inputs,
top_p=0.9,
temperature=0.3,
max_length=2048,
min_length=input_len + 4,
repetition_penalty=1.2,
do_sample=True,
)
response = tokenizer.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True
)[0]
return response
text = "عاصمة دولة الإمارات العربية المتحدة ه"
print(get_response(text))
text = "The capital of UAE is"
print(get_response(text))
📚 Documentation
Jais Family Details
Property |
Details |
Developed by |
Inception, Cerebras Systems |
Language(s) |
(NLP): Arabic (MSA) and English |
Input |
Text only data |
Output |
Model generates text |
Model Sizes |
590M, 1.3B, 2.7B, 6.7B, 7B, 13B, 30B, 70B |
Demo |
Access the live demo here |
License |
Apache 2.0 |
Pre - trained Models
Pre - trained Model |
Fine - tuned Model |
Size (Parameters) |
Context length (Tokens) |
[jais - family - 30b - 16k](https://huggingface.co/inceptionai/jais - family - 30b - 16k) |
[Jais - family - 30b - 16k - chat](https://huggingface.co/inceptionai/jais - family - 30b - 16k - chat) |
30B |
16,384 |
[jais - family - 30b - 8k](https://huggingface.co/inceptionai/jais - family - 30b - 8k) |
[Jais - family - 30b - 8k - chat](https://huggingface.co/inceptionai/jais - family - 30b - 8k - chat) |
30B |
8,192 |
[jais - family - 13b](https://huggingface.co/inceptionai/jais - family - 13b) |
[Jais - family - 13b - chat](https://huggingface.co/inceptionai/jais - family - 13b - chat) |
13B |
2,048 |
[jais - family - 6p7b](https://huggingface.co/inceptionai/jais - family - 6p7b) |
[Jais - family - 6p7b - chat](https://huggingface.co/inceptionai/jais - family - 6p7b - chat) |
6.7B |
2,048 |
[jais - family - 2p7b](https://huggingface.co/inceptionai/jais - family - 2p7b) |
[Jais - family - 2p7b - chat](https://huggingface.co/inceptionai/jais - family - 2p7b - chat) |
2.7B |
2,048 |
[jais - family - 1p3b](https://huggingface.co/inceptionai/jais - family - 1p3b) |
[Jais - family - 1p3b - chat](https://huggingface.co/inceptionai/jais - family - 1p3b - chat) |
1.3B |
2,048 |
[jais - family - 590m](https://huggingface.co/inceptionai/jais - family - 590m) |
[Jais - family - 590m - chat](https://huggingface.co/inceptionai/jais - family - 590m - chat) |
590M |
2,048 |
Adapted Pre - trained Models
Adapted pre - trained Model |
Fine - tuned Model |
Size (Parameters) |
Context length (Tokens) |
[jais - adapted - 70b](https://huggingface.co/inceptionai/jais - adapted - 70b) |
[Jais - adapted - 70b - chat](https://huggingface.co/inceptionai/jais - adapted - 70b - chat) |
70B |
4,096 |
[jais - adapted - 13b](https://huggingface.co/inceptionai/jais - adapted - 13b) |
[Jais - adapted - 13b - chat](https://huggingface.co/inceptionai/jais - adapted - 13b - chat) |
13B |
4,096 |
[jais - adapted - 7b](https://huggingface.co/inceptionai/jais - adapted - 7b) |
[Jais - adapted - 7b - chat](https://huggingface.co/inceptionai/jais - adapted - 7b - chat) |
7B |
4,096 |
Model Architecture
All models in this family are auto - regressive language models that use a transformer - based, decoder - only architecture (GPT - 3). Jais models (jais - family - *
) are trained from scratch, incorporating the SwiGLU non - linear activation function and ALiBi position encoding. Jais adapted models (jais - adapted - *
) are built on top of Llama - 2, which employs RoPE position embedding and Grouped Query Attention. Tokenizer expansion with Arabic data is introduced, improving fertility and compute efficiency by over 3x.
Training Details
Pretraining Data
The Jais family of models are trained on up to 1.6 Trillion tokens of diverse English, Arabic, and Code data from web, code, books, scientific, and synthetic sources.
Pre - trained model |
English data (tokens) |
Arabic data (tokens) |
Code data (tokens) |
Total data (tokens) |
[jais - family - 30b - 16k](https://huggingface.co/inceptionai/jais - family - 30b - 16k) |
980B |
490B |
196B |
1666B |
[jais - family - 30b - 8k](https://huggingface.co/inceptionai/jais - family - 30b - 8k) |
882B |
441B |
177B |
1500B |
[jais - family - 13b](https://huggingface.co/inceptionai/jais - family - 13b) |
283B |
141B |
56B |
480B |
[jais - family - 6p7b](https://huggingface.co/inceptionai/jais - family - 6p7b) |
283B |
141B |
56B |
480B |
[jais - family - 2p7b](https://huggingface.co/inceptionai/jais - family - 2p7b) |
283B |
141B |
56B |
480B |
[jais - family - 1p3b](https://huggingface.co/inceptionai/jais - family - 1p3b) |
283B |
141B |
56B |
480B |
[jais - family - 590m](https://huggingface.co/inceptionai/jais - family - 590m) |
283B |
141B |
56B |
480B |
[jais - adapted - 70b](https://huggingface.co/inceptionai/jais - adapted - 70b) |
33B |
334B |
4B |
371B |
[jais - adapted - 13b](https://huggingface.co/inceptionai/jais - adapted - 13b) |
127B |
140B |
13B |
280B |
[jais - adapted - 7b](https://huggingface.co/inceptionai/jais - adapted - 7b) |
18B |
19B |
2B |
39B |
Finetuning data
All chat models in the Jais family are fine - tuned using Arabic and English prompt - response pairs in single - turn and multi - turn settings. Data sources include open - source fine - tuning datasets and internally curated human data, supplemented with synthetic content.
Training Procedure
- Pre - training (
jais - family - *
): Documents are packed into sequences separated by EOS tokens, and the model is trained autoregressively.
- Adapted pre - training (
jais - adapted - *
): A two - stage approach is used to train the new tokenizer and Arabic embeddings.
- Instruction tuning: Examples are packed together, and the loss is masked on the prompt tokens.
Training Hyperparameters: Jais - family - 2p7b
Hyperparameter |
Value |
Precision |
fp32 |
Optimizer |
AdamW |
Learning rate |
0 to 0.01563(<=127 warmup steps) 0.01563 to 0.000178(>127 and <=162883 steps) |
Weight decay |
0.1 |
Batch size |
1440 |
Context Length |
2048 |
Steps |
162883 |
Compute Infrastructure
The training process was performed on the Condor Galaxy (CG) supercomputer platform, which contains 64 Cerebras CS - 2 Wafer - Scale Engines (WSE - 2) with 40 GB of SRAM and achieves a total of 960 PetaFLOP/s.
🔧 Technical Details
The models' architecture, training data, and training procedures are designed to optimize performance in both Arabic and English. The use of different activation functions, position encodings, and data sources contributes to the models' capabilities.
📄 License
The Jais family of models is released under the Apache 2.0 license.