Noon-7B Open-Source Arabic Large Language Model - Free Deployment to Solve Text, Code, and Math Q&A

Noon 7b

Developed by Naseej

Noon is a 7 billion parameter Arabic large language model based on the BLOOM architecture, specifically designed for instruction fine-tuning. It supports tasks such as text generation, code generation, mathematical problem solving, and Q&A.

Large Language Model

Transformers

Supports Multiple LanguagesOpen Source License:Openrail #Arabic instruction fine-tuning #Multi-task text generation #7 billion parameter scale

Downloads 200

Release Time : 5/20/2023

Model Overview

Noon is one of the largest Arabic language models currently available. Built on the BLOOM architecture and trained using the ColossalAI framework, it supports various Arabic instructions and problem responses.

Model Features

Arabic language optimization

Specially designed for Arabic instruction fine-tuning, it is one of the largest Arabic language models available.

Multi-task support

Capable of handling diverse tasks such as text generation, code generation, mathematical problem solving, and closed/open-book Q&A.

Advanced training techniques

Utilizes the ColossalAI framework for distributed multi-GPU training, with optimizations like LoRA and ZeRO.

Model Capabilities

Text generation

Code generation

Mathematical problem solving

Closed-book Q&A

Open-book Q&A

Use Cases

Education

Teaching assistance

Generates teaching materials or answers student questions.

Mathematical problem solving

Solves elementary school math problems, such as arithmetic exercises.

Health & Lifestyle

Health advice

Provides suggestions for maintaining a healthy lifestyle.

Fasting knowledge

Answers questions about the benefits of fasting.

🚀 Noon - a 7-billion parameter Arabic Large Language Model

Noon is an Arabic Large Language Model based on BLOOM, capable of responding to various instructions and questions, trained on a large - scale Arabic dataset.

🚀 Quick Start

We present the 7 - billion parameter variant of Noon, an Arabic Large Language model based on BLOOM, a foundation model released by the bigscience workshop. Noon was trained with the main focus of having a model that responds to various types of instructions and questions (text generation, code generation, mathematical problems, closed/open - book questions, etc.).

✨ Features

Multifunctional: Capable of handling various tasks such as text generation, code generation, and solving mathematical problems.
Large - scale Training: Trained on over 110,000 Arabic data records, covering more than 11 million words.
Advanced Training Techniques: Utilized advanced training techniques like distributed training on multiple GPUs, LoRA (Low Rank Adaptation), and ZeRO (Zero Redundancy Optimization).

📦 Installation

The usage of our model only requires the Transformers library, and can be loaded as follows:

from transformers import BloomTokenizerFast, BloomForCausalLM, pipeline


text="اكتب مقالا من عدة أسطر عن الذكاء الصناعي وتطوراته"
prompt = f'Instruction:\n{text}\n\nResponse:'

model = BloomForCausalLM.from_pretrained('Naseej/noon-7b')

tokenizer = BloomTokenizerFast.from_pretrained('Naseej/noon-7b')

generation_pipeline = pipeline("text-generation", model=model, tokenizer=tokenizer)

# We recommend the provided hyperparameters for generation
# But encourage you to try different values
response = generation_pipeline(prompt,
    pad_token_id=tokenizer.eos_token_id,
    do_sample=False,
    num_beams=4,
    max_length=500,
    top_p=0.1,
    top_k=20,
    repetition_penalty = 3.0,
    no_repeat_ngram_size=3)[0]['generated_text']

print(response)

🔧 Technical Details

Training Framework

We trained the model using the ColossalAI framework which fully supports the HuggingFace library models, and implements different optimization and quantization techniques for billion - scale LLMs.

Training Hardware

Noon - 7b was trained on 8 - A100 GPUs using Distributed multi - GPU training via the ColossalAI framework.

Training Data

The training data is a combination of Arabic datasets covering multiple tasks. It includes:

[Second version of the Alpaca dataset](https://github.com/Instruction - Tuning - with - GPT - 4/GPT - 4 - LLM), generated using GPT4.
Self - instruct records, split between samples generated by us using the [self - instruct](https://github.com/yizhongw/self - instruct) framework, and further translated ones.
The instructional dataset released by Databricks, which comprises high - quality human - generated instructions and responses.
TruthfulQA dataset, to further guide the model on how to truthfully respond to factoid - based questions.
Grade School Math dataset, to enhance the model's performance using chain - of - thought mathematical problems.
Arabic arithmetic problems, generated by us using ChatGPT for further improvement of the model's ability to solve mathematical problems. The full dataset adds up to over 110K records.

📚 Documentation

Evaluation

Throughout a set of over 4000 Arabic data samples, Noon - 7b was automatically evaluated using OpenAI's GPT3.5 Turbo model. Provided with clear and carefully crafted evaluation criteria (aligning with the model's training objective as well as the syntactic and grammatical rules of the Arabic language), GPT3.5 Turbo was prompted to evaluate each of Noon's responses to an input instruction on a scale of 1 - 5. We concluded the evaluation by averaging the provided scores, adding up to an impressive final score of 4.07/5.

Disclaimer

The generated responses from this AI model are purely algorithmic and should be interpreted with caution. The model's outputs may occasionally exhibit bias, offensive language, or potentially harmful content. It is important to note that these responses do not reflect the personal preferences or viewpoints of the authors or the organization of Naseej. While every effort is made to mitigate the harmfulness of the model's outputs, it is impossible to guarantee complete elimination of biases or offensive content. The model learns from vast amounts of data and may inadvertently replicate or amplify existing societal biases present in the training data. Users are advised to critically evaluate and verify the information provided by the model. Exercise discretion when utilizing the model's responses, particularly in sensitive or controversial topics. We are committed to ongoing research and development to improve the model's performance, minimize biases, and reduce harmful outputs. Your feedback and insights are valuable in helping us achieve these goals.

📄 License

The model is licensed under bigscience - bloom - rail - 1.0.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご