Jais-family-2p7b Open-source Bilingual Large Language Model - Free to Use, Supports Arabic and English Communication

Jais Family 2p7b

Developed by inceptionai

The Jais series is a bilingual large language model specifically optimized for Arabic and English, featuring variants ranging from 590 million to 70 billion parameters.

Large Language Model

PyTorch

Supports Multiple LanguagesOpen Source License:Apache-2.0 #English-Arabic Bilingual LLM #Arabic Language Optimization #Long-Text Processing

Downloads 174

Release Time : 8/2/2024

Model Overview

The Jais series models are optimized English-Arabic bilingual large language models, excelling in Arabic while maintaining robust English capabilities. Includes variants trained from scratch and those adapted from Llama-2 through adaptive pretraining.

Model Features

Bilingual Optimization

Specifically optimized for Arabic and English, delivering exceptional Arabic performance while maintaining strong English capabilities

Multi-Scale Options

Offers model variants ranging from 590 million to 70 billion parameters to meet diverse computational needs

Long-Context Processing

Utilizes ALiBi positional encoding and other techniques to support context processing of up to 16k tokens

High-Quality Data

Trained on up to 1.6 trillion tokens of curated bilingual data, including web pages, books, code, and scientific literature

Model Capabilities

Arabic Text Generation

English Text Generation

Code Generation

Question Answering Systems

Text Summarization

Use Cases

Research Fields

Natural Language Understanding Research

Used for research and experimentation on Arabic NLP tasks

Cultural Alignment Analysis

Research on cultural alignment mechanisms in bilingual models

Commercial Applications

Arabic Chat Assistants

Development of intelligent dialogue systems for Arabic-speaking users

Local Market Analysis

Sentiment analysis of Arabic social media and news

🚀 Jais Family Model Card

The Jais family of models is a comprehensive series of bilingual English - Arabic large language models (LLMs). These models are optimized for Arabic while maintaining strong English capabilities. This release includes two types of foundation models: models pre - trained from scratch (jais - family - *) and models pre - trained adaptively from Llama - 2 (jais - adapted - *). In total, 20 models across 8 sizes, with parameters ranging from 590M to 70B, are introduced, trained on up to 1.6T tokens of Arabic, English, and code data. All pre - trained models in this series are instruction fine - tuned (*-chat) for dialog using a curated mix of Arabic and English instruction data.

✨ Features

Bilingual Excellence: Optimized for Arabic with strong English capabilities.
Diverse Model Sizes: 20 models across 8 sizes, from 590M to 70B parameters.
Extensive Training Data: Trained on up to 1.6T tokens of Arabic, English, and code data.
Instruction Fine - Tuning: All pre - trained models are instruction fine - tuned for dialog.

📦 Installation

No installation steps are provided in the original document, so this section is skipped.

💻 Usage Examples

Basic Usage

# -*- coding: utf-8 -*-

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_path = "inceptionai/jais-family-2p7b"

device = "cuda" if torch.cuda.is_available() else "cpu"

tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path, device_map="auto", trust_remote_code=True)


def get_response(text, tokenizer=tokenizer, model=model):
    input_ids = tokenizer(text, return_tensors="pt").input_ids
    inputs = input_ids.to(device)
    input_len = inputs.shape[-1]
    generate_ids = model.generate(
        inputs,
        top_p=0.9,
        temperature=0.3,
        max_length=2048,
        min_length=input_len + 4,
        repetition_penalty=1.2,
        do_sample=True,
    )
    response = tokenizer.batch_decode(
        generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True
    )[0]
    return response


text = "عاصمة دولة الإمارات العربية المتحدة ه"
print(get_response(text))

text = "The capital of UAE is"
print(get_response(text))

📚 Documentation

Jais Family Details

Property	Details
Developed by	Inception, Cerebras Systems
Language(s)	(NLP): Arabic (MSA) and English
Input	Text only data
Output	Model generates text
Model Sizes	590M, 1.3B, 2.7B, 6.7B, 7B, 13B, 30B, 70B
Demo	Access the live demo here
License	Apache 2.0

Pre - trained Models

Pre - trained Model	Fine - tuned Model	Size (Parameters)	Context length (Tokens)
[jais - family - 30b - 16k](https://huggingface.co/inceptionai/jais - family - 30b - 16k)	[Jais - family - 30b - 16k - chat](https://huggingface.co/inceptionai/jais - family - 30b - 16k - chat)	30B	16,384
[jais - family - 30b - 8k](https://huggingface.co/inceptionai/jais - family - 30b - 8k)	[Jais - family - 30b - 8k - chat](https://huggingface.co/inceptionai/jais - family - 30b - 8k - chat)	30B	8,192
[jais - family - 13b](https://huggingface.co/inceptionai/jais - family - 13b)	[Jais - family - 13b - chat](https://huggingface.co/inceptionai/jais - family - 13b - chat)	13B	2,048
[jais - family - 6p7b](https://huggingface.co/inceptionai/jais - family - 6p7b)	[Jais - family - 6p7b - chat](https://huggingface.co/inceptionai/jais - family - 6p7b - chat)	6.7B	2,048
[jais - family - 2p7b](https://huggingface.co/inceptionai/jais - family - 2p7b)	[Jais - family - 2p7b - chat](https://huggingface.co/inceptionai/jais - family - 2p7b - chat)	2.7B	2,048
[jais - family - 1p3b](https://huggingface.co/inceptionai/jais - family - 1p3b)	[Jais - family - 1p3b - chat](https://huggingface.co/inceptionai/jais - family - 1p3b - chat)	1.3B	2,048
[jais - family - 590m](https://huggingface.co/inceptionai/jais - family - 590m)	[Jais - family - 590m - chat](https://huggingface.co/inceptionai/jais - family - 590m - chat)	590M	2,048

Adapted Pre - trained Models

Adapted pre - trained Model	Fine - tuned Model	Size (Parameters)	Context length (Tokens)
[jais - adapted - 70b](https://huggingface.co/inceptionai/jais - adapted - 70b)	[Jais - adapted - 70b - chat](https://huggingface.co/inceptionai/jais - adapted - 70b - chat)	70B	4,096
[jais - adapted - 13b](https://huggingface.co/inceptionai/jais - adapted - 13b)	[Jais - adapted - 13b - chat](https://huggingface.co/inceptionai/jais - adapted - 13b - chat)	13B	4,096
[jais - adapted - 7b](https://huggingface.co/inceptionai/jais - adapted - 7b)	[Jais - adapted - 7b - chat](https://huggingface.co/inceptionai/jais - adapted - 7b - chat)	7B	4,096

Model Architecture

All models in this family are auto - regressive language models that use a transformer - based, decoder - only architecture (GPT - 3). Jais models (jais - family - *) are trained from scratch, incorporating the SwiGLU non - linear activation function and ALiBi position encoding. Jais adapted models (jais - adapted - *) are built on top of Llama - 2, which employs RoPE position embedding and Grouped Query Attention. Tokenizer expansion with Arabic data is introduced, improving fertility and compute efficiency by over 3x.

Training Details

Pretraining Data

The Jais family of models are trained on up to 1.6 Trillion tokens of diverse English, Arabic, and Code data from web, code, books, scientific, and synthetic sources.

Pre - trained model	English data (tokens)	Arabic data (tokens)	Code data (tokens)	Total data (tokens)
[jais - family - 30b - 16k](https://huggingface.co/inceptionai/jais - family - 30b - 16k)	980B	490B	196B	1666B
[jais - family - 30b - 8k](https://huggingface.co/inceptionai/jais - family - 30b - 8k)	882B	441B	177B	1500B
[jais - family - 13b](https://huggingface.co/inceptionai/jais - family - 13b)	283B	141B	56B	480B
[jais - family - 6p7b](https://huggingface.co/inceptionai/jais - family - 6p7b)	283B	141B	56B	480B
[jais - family - 2p7b](https://huggingface.co/inceptionai/jais - family - 2p7b)	283B	141B	56B	480B
[jais - family - 1p3b](https://huggingface.co/inceptionai/jais - family - 1p3b)	283B	141B	56B	480B
[jais - family - 590m](https://huggingface.co/inceptionai/jais - family - 590m)	283B	141B	56B	480B
[jais - adapted - 70b](https://huggingface.co/inceptionai/jais - adapted - 70b)	33B	334B	4B	371B
[jais - adapted - 13b](https://huggingface.co/inceptionai/jais - adapted - 13b)	127B	140B	13B	280B
[jais - adapted - 7b](https://huggingface.co/inceptionai/jais - adapted - 7b)	18B	19B	2B	39B

Finetuning data

All chat models in the Jais family are fine - tuned using Arabic and English prompt - response pairs in single - turn and multi - turn settings. Data sources include open - source fine - tuning datasets and internally curated human data, supplemented with synthetic content.

Training Procedure

Pre - training (jais - family - *): Documents are packed into sequences separated by EOS tokens, and the model is trained autoregressively.
Adapted pre - training (jais - adapted - *): A two - stage approach is used to train the new tokenizer and Arabic embeddings.
Instruction tuning: Examples are packed together, and the loss is masked on the prompt tokens.

Training Hyperparameters: Jais - family - 2p7b

Hyperparameter	Value
Precision	fp32
Optimizer	AdamW
Learning rate	0 to 0.01563(<=127 warmup steps) 0.01563 to 0.000178(>127 and <=162883 steps)
Weight decay	0.1
Batch size	1440
Context Length	2048
Steps	162883

Compute Infrastructure

The training process was performed on the Condor Galaxy (CG) supercomputer platform, which contains 64 Cerebras CS - 2 Wafer - Scale Engines (WSE - 2) with 40 GB of SRAM and achieves a total of 960 PetaFLOP/s.

🔧 Technical Details

The models' architecture, training data, and training procedures are designed to optimize performance in both Arabic and English. The use of different activation functions, position encodings, and data sources contributes to the models' capabilities.

📄 License

The Jais family of models is released under the Apache 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご