Jais-30b-v1 Open Source Large Language Model - Supports Both Arabic and English, Ultramodern in Handling Arabic Tasks

Jais 30b V1

Developed by inceptionai

JAIS-30B is a 30-billion-parameter bilingual (Arabic and English) large language model based on the GPT-3 architecture, utilizing ALiBi positional embedding technology, achieving state-of-the-art performance in Arabic tasks.

Large Language Model

Transformers

Supports Multiple LanguagesOpen Source License:Apache-2.0 #Arabic Large Language Model #Bilingual Generation #30 Billion Parameters

Downloads 37

Release Time : 10/27/2023

Model Overview

This is a 30-billion-parameter large language model optimized for Arabic and English, featuring a decoder-only Transformer architecture, supporting long sequence processing, and suitable for tasks like text generation.

Model Features

Bilingual Capability

Specially optimized for Arabic and English, excelling in Arabic tasks

Long Sequence Processing

Utilizes ALiBi positional embedding technology, supporting longer context windows

High-Performance Architecture

Based on GPT-3 architecture with SwiGLU activation function, enhancing model performance

Model Capabilities

Arabic Text Generation

English Text Generation

Factual Question Answering

Reasoning Task Processing

Use Cases

Research

Arabic NLP Research

Used for research in the field of Arabic natural language processing

Achieves state-of-the-art performance in Arabic evaluation benchmarks

Commercial Applications

Chat Assistant

Can serve as a base model for developing Arabic chatbots

Customer Service

Used to handle queries and service requests from Arabic-speaking customers

🚀 Jais-30b-v1

Jais-30b-v1 is a pre-trained bilingual large language model with 30 billion parameters, supporting both Arabic and English. It offers high performance in various language tasks and is suitable for research and commercial applications.

🚀 Quick Start

Below is sample code to use the model. Note that the model requires a custom model class, so users must enable trust_remote_code=True while loading the model. Also, note that this code is tested on transformers==4.32.0.

# -*- coding: utf-8 -*-

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
model_path = "core42/jais-30b-v1"

device = "cuda" if torch.cuda.is_available() else "cpu"

tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path, device_map="auto", trust_remote_code=True)


def get_response(text,tokenizer=tokenizer,model=model):
    input_ids = tokenizer(text, return_tensors="pt").input_ids
    inputs = input_ids.to(device)
    input_len = inputs.shape[-1]
    generate_ids = model.generate(
        inputs,
        top_p=0.9,
        temperature=0.3,
        max_length=200,
        min_length=input_len + 4,
        repetition_penalty=1.2,
        do_sample=True,
    )
    response = tokenizer.batch_decode(
        generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True
    )[0]
    return response


text= "عاصمة دولة الإمارات العربية المتحدة ه"
print(get_response(text))

text = "The capital of UAE is"
print(get_response(text))

✨ Features

Bilingual Support: Optimized for both Arabic and English, enabling seamless communication in these two languages.
Transformer-based Architecture: Built on a decoder-only (GPT-3) architecture with SwiGLU non-linearity, ensuring high performance.
ALiBi Position Embeddings: Allows the model to handle long sequences effectively, improving context handling and precision.

📚 Documentation

Model Details

Property	Details
Developed by	Core42 (Inception), Cerebras Systems
Language(s) (NLP)	Arabic and English
License	Apache 2.0
Input	Text only data
Output	Model generates text
Paper	Jais and Jais-chat: Arabic-Centric Foundation and Instruction-Tuned Open Generative Large Language Models
Blog	Access here
Demo	Access here

Intended Use

We release the Jais 30B model under a full open source license and welcome all feedback and collaboration opportunities.

Potential Downstream Uses

Research: Suitable for researchers and developers in the field of natural language processing.
Commercial Use: Can be used as a base model for further fine-tuning in specific scenarios, such as chat-assistants and customer service.

Target Audiences

Academics: Those researching Arabic natural language processing.
Businesses: Companies targeting Arabic-speaking audiences.
Developers: Those integrating Arabic language capabilities into apps.

Out-of-Scope Use

While Jais-30b is a powerful bilingual model, it has limitations and potential for misuse.

Malicious Use: Prohibited from generating harmful, misleading, or inappropriate content, including hate speech, misinformation, and illegal activities.
Sensitive Information: Should not be used to handle or generate personal, confidential, or sensitive information.
Generalization Across All Languages: Not assumed to have equal proficiency in other languages or dialects.
High-Stakes Decisions: Should not be used for high-stakes decisions without human oversight.

Bias, Risks, and Limitations

The model is trained on publicly available data curated in part by Inception. Although efforts have been made to reduce bias, it may still exhibit some bias, as with all LLM models.

It is designed as an AI assistant for Arabic and English speakers and may not produce appropriate responses for other languages.

By using Jais, you acknowledge that it may generate incorrect, misleading, or offensive information. The information is not intended as advice, and we are not responsible for its content or consequences. We welcome feedback to improve the model.

Training Details

Training Data

For pre-training Jais-30b, we used a diverse bilingual corpus from the web and other sources, as well as publicly available English and code datasets. Arabic data was collected from multiple sources, including web pages, Wikipedia articles, news articles, Arabic books, and social network content. We augmented the Arabic data by translating high-quality English resources using an in-house machine translation system. Our data acquisition strategy is similar to that of Jais-13b.

Training Procedure

Training was performed on the Condor Galaxy 1 (CG-1) supercomputer platform.

Training Hyperparameters

Hyperparameter	Value
Precision	fp32
Optimizer	AdamW
Learning rate	0 to 0.012 (<= 69 steps); 0.012 to 0.005 (> 69 & < 70k steps); 0.005 to 0.0008 (>70k - 79k)
Weight decay	0.1
Batch size	2640
Steps	79k

Evaluation

We conducted a comprehensive evaluation of Jais and benchmarked it against other leading base language models in both English and Arabic. The evaluation criteria included knowledge, reasoning, and susceptibility to misinformation/bias.

Arabic Evaluation Results

Models	Avg	EXAMS	MMLU (M)	LitQA	Hellaswag	PIQA	BoolQA	SituatedQA	ARC-C	OpenBookQA	TruthfulQA	CrowS-Pairs
Jais (30B)	47.8	40	30.8	58.3	60.1	70	68.7	43.3	38.5	32.2	42.6	56.9
Jais (13B)	46.5	40.4	30.0	58.3	57.7	67.6	62.6	42.5	35.8	32.4	41.1	58.4
acegpt-13b	42.5	34.7	29.9	42.3	45.6	60.3	63.2	38.1	32.8	32.2	45.1	56.4
acegpt-7b	42.4	35.4	29	46.3	43.8	60.4	63.4	37.2	31.1	32	45.3	55.4
BLOOM (7.1B)	40.9	34.0	28.2	37.1	40.9	58.4	59.9	39.1	27.3	28.0	44.4	53.5
LLaMA (30B)	38.8	27.9	28.5	32.6	35	52.7	63.7	34.9	25.7	28.6	47.2	49.8
LLaMA2 (13B)	38.1	29.2	28.4	32.0	34.3	52.9	63.8	36.4	24.3	30.0	45.5	49.9

English Evaluation Results

Models	Avg	MMLU	RACE	Hellaswag	PIQA	BoolQA	SituatedQA	ARC-C	OpenBookQA	Winogrande	TruthfulQA	CrowS-Pairs
Jais (30B)	56.2	34.5	39.8	75.1	79.5	74.3	49.9	45.9	41.2	68.4	36.5	73.3
Jais (13B)	53.9	31.5	38.3	71.8	77.9	67.6	48.2	41.9	40.6	68.4	35.4	71.5
OPT-30b	59.4	38.6	45.2	71.7	78.5	87.3	63.4	44.8	40.2	72.2	38.7	72.7
MPT-30b	57.3	38.8	39.7	80	80.8	73.9	45.6	49.2	43.2	71.1	38.3	69.3
Llama-30b	55.4	37	40.2	79.2	80.1	68.3	44	45.3	42	72.7	42.3	58.2
Falcon (40B)	54.8	31.3	37.1	76.4	80.5	73.7	43.2	43.6	44.2	67.2	34.3	72.3

Citation

@misc{sengupta2023jais,
      title={Jais and Jais-chat: Arabic-Centric Foundation and Instruction-Tuned Open Generative Large Language Models}, 
      author={Neha Sengupta and Sunil Kumar Sahu and Bokang Jia and Satheesh Katipomu and Haonan Li and Fajri Koto and Osama Mohammed Afzal and Samta Kamboj and Onkar Pandit and Rahul Pal and Lalit Pradhan and Zain Muhammad Mujahid and Massa Baali and Alham Fikri Aji and Zhengzhong Liu and Andy Hock and Andrew Feldman and Jonathan Lee and Andrew Jackson and Preslav Nakov and Timothy Baldwin and Eric Xing},
      year={2023},
      eprint={2308.16149},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご