🚀 Jais-30b-v1
Jais-30b-v1 is a pre-trained bilingual large language model with 30 billion parameters, supporting both Arabic and English. It offers high performance in various language tasks and is suitable for research and commercial applications.
🚀 Quick Start
Below is sample code to use the model. Note that the model requires a custom model class, so users must enable trust_remote_code=True
while loading the model. Also, note that this code is tested on transformers==4.32.0
.
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
model_path = "core42/jais-30b-v1"
device = "cuda" if torch.cuda.is_available() else "cpu"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path, device_map="auto", trust_remote_code=True)
def get_response(text,tokenizer=tokenizer,model=model):
input_ids = tokenizer(text, return_tensors="pt").input_ids
inputs = input_ids.to(device)
input_len = inputs.shape[-1]
generate_ids = model.generate(
inputs,
top_p=0.9,
temperature=0.3,
max_length=200,
min_length=input_len + 4,
repetition_penalty=1.2,
do_sample=True,
)
response = tokenizer.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True
)[0]
return response
text= "عاصمة دولة الإمارات العربية المتحدة ه"
print(get_response(text))
text = "The capital of UAE is"
print(get_response(text))
✨ Features
- Bilingual Support: Optimized for both Arabic and English, enabling seamless communication in these two languages.
- Transformer-based Architecture: Built on a decoder-only (GPT-3) architecture with SwiGLU non-linearity, ensuring high performance.
- ALiBi Position Embeddings: Allows the model to handle long sequences effectively, improving context handling and precision.
📚 Documentation
Model Details
Intended Use
We release the Jais 30B model under a full open source license and welcome all feedback and collaboration opportunities.
Potential Downstream Uses
- Research: Suitable for researchers and developers in the field of natural language processing.
- Commercial Use: Can be used as a base model for further fine-tuning in specific scenarios, such as chat-assistants and customer service.
Target Audiences
- Academics: Those researching Arabic natural language processing.
- Businesses: Companies targeting Arabic-speaking audiences.
- Developers: Those integrating Arabic language capabilities into apps.
Out-of-Scope Use
While Jais-30b is a powerful bilingual model, it has limitations and potential for misuse.
- Malicious Use: Prohibited from generating harmful, misleading, or inappropriate content, including hate speech, misinformation, and illegal activities.
- Sensitive Information: Should not be used to handle or generate personal, confidential, or sensitive information.
- Generalization Across All Languages: Not assumed to have equal proficiency in other languages or dialects.
- High-Stakes Decisions: Should not be used for high-stakes decisions without human oversight.
Bias, Risks, and Limitations
The model is trained on publicly available data curated in part by Inception. Although efforts have been made to reduce bias, it may still exhibit some bias, as with all LLM models.
It is designed as an AI assistant for Arabic and English speakers and may not produce appropriate responses for other languages.
By using Jais, you acknowledge that it may generate incorrect, misleading, or offensive information. The information is not intended as advice, and we are not responsible for its content or consequences. We welcome feedback to improve the model.
Training Details
Training Data
For pre-training Jais-30b, we used a diverse bilingual corpus from the web and other sources, as well as publicly available English and code datasets. Arabic data was collected from multiple sources, including web pages, Wikipedia articles, news articles, Arabic books, and social network content. We augmented the Arabic data by translating high-quality English resources using an in-house machine translation system. Our data acquisition strategy is similar to that of Jais-13b.
Training Procedure
Training was performed on the Condor Galaxy 1 (CG-1) supercomputer platform.
Training Hyperparameters
Hyperparameter |
Value |
Precision |
fp32 |
Optimizer |
AdamW |
Learning rate |
0 to 0.012 (<= 69 steps); 0.012 to 0.005 (> 69 & < 70k steps); 0.005 to 0.0008 (>70k - 79k) |
Weight decay |
0.1 |
Batch size |
2640 |
Steps |
79k |
Evaluation
We conducted a comprehensive evaluation of Jais and benchmarked it against other leading base language models in both English and Arabic. The evaluation criteria included knowledge, reasoning, and susceptibility to misinformation/bias.
Arabic Evaluation Results
Models |
Avg |
EXAMS |
MMLU (M) |
LitQA |
Hellaswag |
PIQA |
BoolQA |
SituatedQA |
ARC-C |
OpenBookQA |
TruthfulQA |
CrowS-Pairs |
Jais (30B) |
47.8 |
40 |
30.8 |
58.3 |
60.1 |
70 |
68.7 |
43.3 |
38.5 |
32.2 |
42.6 |
56.9 |
Jais (13B) |
46.5 |
40.4 |
30.0 |
58.3 |
57.7 |
67.6 |
62.6 |
42.5 |
35.8 |
32.4 |
41.1 |
58.4 |
acegpt-13b |
42.5 |
34.7 |
29.9 |
42.3 |
45.6 |
60.3 |
63.2 |
38.1 |
32.8 |
32.2 |
45.1 |
56.4 |
acegpt-7b |
42.4 |
35.4 |
29 |
46.3 |
43.8 |
60.4 |
63.4 |
37.2 |
31.1 |
32 |
45.3 |
55.4 |
BLOOM (7.1B) |
40.9 |
34.0 |
28.2 |
37.1 |
40.9 |
58.4 |
59.9 |
39.1 |
27.3 |
28.0 |
44.4 |
53.5 |
LLaMA (30B) |
38.8 |
27.9 |
28.5 |
32.6 |
35 |
52.7 |
63.7 |
34.9 |
25.7 |
28.6 |
47.2 |
49.8 |
LLaMA2 (13B) |
38.1 |
29.2 |
28.4 |
32.0 |
34.3 |
52.9 |
63.8 |
36.4 |
24.3 |
30.0 |
45.5 |
49.9 |
English Evaluation Results
Models |
Avg |
MMLU |
RACE |
Hellaswag |
PIQA |
BoolQA |
SituatedQA |
ARC-C |
OpenBookQA |
Winogrande |
TruthfulQA |
CrowS-Pairs |
Jais (30B) |
56.2 |
34.5 |
39.8 |
75.1 |
79.5 |
74.3 |
49.9 |
45.9 |
41.2 |
68.4 |
36.5 |
73.3 |
Jais (13B) |
53.9 |
31.5 |
38.3 |
71.8 |
77.9 |
67.6 |
48.2 |
41.9 |
40.6 |
68.4 |
35.4 |
71.5 |
OPT-30b |
59.4 |
38.6 |
45.2 |
71.7 |
78.5 |
87.3 |
63.4 |
44.8 |
40.2 |
72.2 |
38.7 |
72.7 |
MPT-30b |
57.3 |
38.8 |
39.7 |
80 |
80.8 |
73.9 |
45.6 |
49.2 |
43.2 |
71.1 |
38.3 |
69.3 |
Llama-30b |
55.4 |
37 |
40.2 |
79.2 |
80.1 |
68.3 |
44 |
45.3 |
42 |
72.7 |
42.3 |
58.2 |
Falcon (40B) |
54.8 |
31.3 |
37.1 |
76.4 |
80.5 |
73.7 |
43.2 |
43.6 |
44.2 |
67.2 |
34.3 |
72.3 |
Citation
@misc{sengupta2023jais,
title={Jais and Jais-chat: Arabic-Centric Foundation and Instruction-Tuned Open Generative Large Language Models},
author={Neha Sengupta and Sunil Kumar Sahu and Bokang Jia and Satheesh Katipomu and Haonan Li and Fajri Koto and Osama Mohammed Afzal and Samta Kamboj and Onkar Pandit and Rahul Pal and Lalit Pradhan and Zain Muhammad Mujahid and Massa Baali and Alham Fikri Aji and Zhengzhong Liu and Andy Hock and Andrew Feldman and Jonathan Lee and Andrew Jackson and Preslav Nakov and Timothy Baldwin and Eric Xing},
year={2023},
eprint={2308.16149},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
Copyright Inception Institute of Artificial Intelligence Ltd.