Jais - family - 6p7b: An open - source English - Arabic bilingual large model optimized specifically for Arabic and also strong in English!

Jais Family 6p7b

Developed by inceptionai

The Jais series is a large English-Arabic bilingual language model specifically optimized for Arabic, with strong English capabilities and 670 million parameters

Large Language Model

Safetensors

Supports Multiple LanguagesOpen Source License:Apache-2.0 #English-Arabic Bilingual Large Model #Arabic Language Optimization #Long Context Processing

Downloads 79

Release Time : 8/2/2024

Model Overview

A Transformer decoder-based English-Arabic bilingual large language model supporting text generation tasks, with special optimization for Arabic language processing

Model Features

Bilingual Optimization

Specially optimized for Arabic while maintaining strong English capabilities, with an Arabic:English training data ratio of 1:2

Long Context Support

Natively supports 2048 token context length, with some models extended to 16K

Diverse Training Data

Trained on 1.6 trillion tokens of web pages, books, code, and scientific literature data

Instruction Fine-tuning

All pre-trained models are fine-tuned with Arabic and English instruction data

Model Capabilities

Arabic Text Generation

English Text Generation

Bilingual Q&A

Code Generation

Long Text Processing

Use Cases

Research Applications

Arabic NLP Research

Used for natural language understanding and generation task research

Cultural Alignment Analysis

Research on cultural alignment mechanisms in bilingual pre-trained models

Commercial Applications

Arabic Chat Assistant

Development of intelligent dialogue systems for Arabic-speaking users

Bilingual Summarization

Generating Arabic-English bilingual document summaries

🚀 Jais Family Model Card

The Jais family of models is a comprehensive series of bilingual English - Arabic large language models (LLMs). These models are optimized for excellent performance in Arabic while also having strong English capabilities. This release aims to accelerate research in Arabic NLP and enable numerous downstream applications for the Arabic - speaking and bilingual community.

✨ Features

Two Variants of Foundation Models:
- Models pre - trained from scratch (jais - family - *).
- Models pre - trained adaptively from Llama - 2 (jais - adapted - *).
Multiple Sizes: In this release, 20 models across 8 sizes are introduced, ranging from 590M to 70B parameters, trained on up to 1.6T tokens of Arabic, English, and code data.
Instruction Fine - Tuned: All pre - trained models in this series are instruction fine - tuned (*-chat) for dialog using a curated mix of Arabic and English instruction data.

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

# -*- coding: utf-8 -*-

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_path = "inceptionai/jais-family-6p7b"

device = "cuda" if torch.cuda.is_available() else "cpu"

tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path, device_map="auto", trust_remote_code=True)


def get_response(text, tokenizer=tokenizer, model=model):
    input_ids = tokenizer(text, return_tensors="pt").input_ids
    inputs = input_ids.to(device)
    input_len = inputs.shape[-1]
    generate_ids = model.generate(
        inputs,
        top_p=0.9,
        temperature=0.3,
        max_length=2048,
        min_length=input_len + 4,
        repetition_penalty=1.2,
        do_sample=True,
    )
    response = tokenizer.batch_decode(
        generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True
    )[0]
    return response


text = "عاصمة دولة الإمارات العربية المتحدة ه"
print(get_response(text))

text = "The capital of UAE is"
print(get_response(text))

📚 Documentation

Jais Family Details

Property	Details
Developed by	Inception, Cerebras Systems.
Language(s)	(NLP): Arabic (MSA) and English.
Input	Text only data.
Output	Model generates text.
Model Sizes	590M, 1.3B, 2.7B, 6.7B, 7B, 13B, 30B, 70B.
Demo	Access the live demo here
License	Apache 2.0

Pre - trained Models

Pre - trained Model	Fine - tuned Model	Size (Parameters)	Context length (Tokens)
[jais - family - 30b - 16k](https://huggingface.co/inceptionai/jais - family - 30b - 16k)	[Jais - family - 30b - 16k - chat](https://huggingface.co/inceptionai/jais - family - 30b - 16k - chat)	30B	16,384
[jais - family - 30b - 8k](https://huggingface.co/inceptionai/jais - family - 30b - 8k)	[Jais - family - 30b - 8k - chat](https://huggingface.co/inceptionai/jais - family - 30b - 8k - chat)	30B	8,192
[jais - family - 13b](https://huggingface.co/inceptionai/jais - family - 13b)	[Jais - family - 13b - chat](https://huggingface.co/inceptionai/jais - family - 13b - chat)	13B	2,048
[jais - family - 6p7b](https://huggingface.co/inceptionai/jais - family - 6p7b)	[Jais - family - 6p7b - chat](https://huggingface.co/inceptionai/jais - family - 6p7b - chat)	6.7B	2,048
[jais - family - 2p7b](https://huggingface.co/inceptionai/jais - family - 2p7b)	[Jais - family - 2p7b - chat](https://huggingface.co/inceptionai/jais - family - 2p7b - chat)	2.7B	2,048
[jais - family - 1p3b](https://huggingface.co/inceptionai/jais - family - 1p3b)	[Jais - family - 1p3b - chat](https://huggingface.co/inceptionai/jais - family - 1p3b - chat)	1.3B	2,048
[jais - family - 590m](https://huggingface.co/inceptionai/jais - family - 590m)	[Jais - family - 590m - chat](https://huggingface.co/inceptionai/jais - family - 590m - chat)	590M	2,048

Adapted Pre - trained Models

Adapted pre - trained Model	Fine - tuned Model	Size (Parameters)	Context length (Tokens)
[jais - adapted - 70b](https://huggingface.co/inceptionai/jais - adapted - 70b)	[Jais - adapted - 70b - chat](https://huggingface.co/inceptionai/jais - adapted - 70b - chat)	70B	4,096
[jais - adapted - 13b](https://huggingface.co/inceptionai/jais - adapted - 13b)	[Jais - adapted - 13b - chat](https://huggingface.co/inceptionai/jais - adapted - 13b - chat)	13B	4,096
[jais - adapted - 7b](https://huggingface.co/inceptionai/jais - adapted - 7b)	[Jais - adapted - 7b - chat](https://huggingface.co/inceptionai/jais - adapted - 7b - chat)	7B	4,096

Model Architecture

All models in this family are auto - regressive language models that use a transformer - based, decoder - only architecture (GPT - 3).

Jais models (jais - family - *) are trained from scratch, incorporating the SwiGLU non - linear activation function and ALiBi position encoding. These architectural enhancements allow the models to extrapolate at long sequence lengths, leading to improved context handling and precision.

Jais adapted models (jais - adapted - *) are built on top of Llama - 2, which employs RoPE position embedding and Grouped Query Attention. Tokenizer expansion with Arabic data is introduced, which improves fertility and compute efficiency by over 3x. Specifically, 32,000 new Arabic tokens from the Jais - 30b vocabulary are added into the Llama - 2 tokenizer. To initialize these new Arabic token embeddings, a linear projection from the embedding space of Jais - 30b to Llama's embedding space is first learned using the set of shared English tokens present in both vocabularies. Then, this learned projection is applied to transform the existing Jais - 30b Arabic embeddings into the Llama - 2 embedding space.

Training Details

Pretraining Data

The Jais family of models are trained on up to 1.6 Trillion tokens of diverse English, Arabic and Code data from the following sources:

Web: Publicly available web pages, wikipedia articles, news articles, and social network content in both Arabic and English.
Code: To enhance the reasoning capability of the model, code data in various programming languages is included.
Books: A selection of publicly available Arabic and English books data is used to improve long - range context modelling and coherent storytelling.
Scientific: A subset of ArXiv papers is included to improve reasoning and long context abilities.
Synthetic: The volume of Arabic data is augmented by translating English to Arabic using an in - house machine translation system, restricted to high - quality English resources such as English Wikipedia and English books.

The training data is extensively preprocessed and deduplicated. For Arabic, a custom preprocessing pipeline is used to filter for data with high linguistic quality. More information on this pipeline can be found in the Jais paper.

Pre - trained model	English data (tokens)	Arabic data (tokens)	Code data (tokens)	Total data (tokens)
[jais - family - 30b - 16k](https://huggingface.co/inceptionai/jais - family - 30b - 16k)	980B	490B	196B	1666B
[jais - family - 30b - 8k](https://huggingface.co/inceptionai/jais - family - 30b - 8k)	882B	441B	177B	1500B
[jais - family - 13b](https://huggingface.co/inceptionai/jais - family - 13b)	283B	141B	56B	480B
[jais - family - 6p7b](https://huggingface.co/inceptionai/jais - family - 6p7b)	283B	141B	56B	480B
[jais - family - 2p7b](https://huggingface.co/inceptionai/jais - family - 2p7b)	283B	141B	56B	480B
[jais - family - 1p3b](https://huggingface.co/inceptionai/jais - family - 1p3b)	283B	141B	56B	480B
[jais - family - 590m](https://huggingface.co/inceptionai/jais - family - 590m)	283B	141B	56B	480B
[jais - adapted - 70b](https://huggingface.co/inceptionai/jais - adapted - 70b)	33B	334B	4B	371B
[jais - adapted - 13b](https://huggingface.co/inceptionai/jais - adapted - 13b)	127B	140B	13B	280B
[jais - adapted - 7b](https://huggingface.co/inceptionai/jais - adapted - 7b)	18B	19B	2B	39B

Finetuning data

All chat models in the Jais family are fine - tuned using Arabic and English prompt - response pairs in both single - turn and multi - turn settings. Data sources include open - source fine - tuning datasets filtered for topic and style diversity. Internally curated human data is also incorporated to enhance cultural adaptation. This data is supplemented with content generated using synthetic methods including machine translation, distillation, and model self - chat. Overall, the updated instruction - tuning dataset comprises ~10M and ~4M prompt - response pairs in English and Arabic respectively.

Training Procedure

Pre - training of (jais - family - *) models: Documents are packed into sequences separated by EOS tokens, and the model is trained autoregressively, applying the loss to all tokens. For jais - 30b models, the context length is progressively expanded from 2k to 8K to 16K by incorporating curated long - context documents in training.
Adapted pre - training of the (jais - adapted - *) models: First, the new tokenizer and Arabic embeddings are initialized as described in [Model Architecture](#model - architecture). A two - stage approach is implemented to overcome the higher norms of the new Arabic embeddings. In the first stage, the backbone of the model is frozen, and the embeddings are trained using approximately 15 billion tokens from a bilingual corpus of English and Arabic. In the second stage, the backbone is unfrozen, and continuous pretraining is conducted with all parameters.
Instruction tuning: Each training example consists of a single - turn or multi - turn prompt and its response. Examples are packed together, and the loss is masked on the prompt tokens to speed up training.

Training Hyperparameters:

Jais - family - 6p7b

Hyperparameter	Value
Precision	fp32
Optimizer	AdamW
Learning rate	0 to 0.01563(<=112 warmup steps) 0.01563 to 0.000443(>112 and <=143721 steps)
Weight decay	0.1
Batch size	1632
Context Length	2048
Steps	143721

Compute Infrastructure

The training process was performed on the Condor Galaxy (CG) supercomputer platform. A CG contains 64 Cerebras CS - 2 Wafer - Scale Engines (WSE - 2) with 40 GB of SRAM, and achieves a total of 960 PetaFLOP/s.

Evaluation

A comprehensive evaluation of Jais models focusing on both English and Arabic was conducted using LM - harness in a zero - shot setting. The evaluation criteria covered various dimensions:

Knowledge: How well the model answers factual questions.
Reasoning: The model's ability to answer questions requiring reasoning.
... (The original content seems incomplete here)

🔧 Technical Details

The detailed technical aspects are described in the sections of Model Architecture and Training Details.

📄 License

The Jais family of models is released under the Apache 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご