Fanar-1-9B-Instruct: An Open-Source Multilingual Large Model - Supports Arabic-English translation and is in line with Arab culture

Fanar 1 9B Instruct

Developed by QCRI

Fanar-1-9B-Instruct is a powerful Arabic-English large language model developed by the Qatar Computing Research Institute (QCRI). It supports Modern Standard Arabic and multiple Arabic dialects, and is aligned with Islamic values and Arab culture.

Large Language Model

Transformers

Supports Multiple LanguagesOpen Source License:Apache-2.0 #Arabic language priority #Alignment with Islamic culture #Support for multiple dialects

Downloads 115

Release Time : 6/1/2025

Model Overview

Fanar-1-9B-Instruct is an Arabic-English large language model that undergoes continuous pre-training based on the google/gemma-2-9b model. It has been fine-tuned with instructions, supports multiple Arabic dialects, and serves as the core component of the Fanar generative AI platform.

Model Features

Multilingual support

Supports Arabic and English, covering Modern Standard Arabic and multiple Arabic dialects (such as Gulf, Levantine, and Egyptian dialects).

Cultural alignment

Aligned with Islamic values and Arab culture through carefully curated data.

Multi-functional platform

As the core component of the Fanar generative AI platform, it provides various functions such as image generation, video and image understanding.

High performance

Performs excellently in multiple benchmark tests, especially in Arabic language tasks.

Model Capabilities

Text generation

Instruction following

Multilingual support

Culturally aligned text generation

Islamic Retrieval Augmented Generation (RAG)

Use Cases

Education

Arabic language learning assistance

Helps learners understand Modern Standard Arabic and dialects.

Provides accurate Arabic explanations and examples.

Cultural dissemination

Islamic cultural content generation

Generates content that is consistent with Islamic values and Arab culture.

Ensures the religious and cultural sensitivity of the content.

Multi-modal applications

Image and video understanding

As part of the Fanar platform, it supports the understanding and generation of images and videos.

Provides a multi-modal interaction experience.

🚀 Fanar-1-9B-Instruct

Fanar-1-9B-Instruct is a potent Arabic-English large language model (LLM) developed by the Qatar Computing Research Institute (QCRI) at Hamad Bin Khalifa University (HBKU), a member of the Qatar Foundation for Education, Science, and Community Development. It is the instruction-tuned version of Fanar-1-9B. We continuously pretrain the google/gemma-2-9b model on 1 trillion Arabic and English tokens. Special attention is paid to the richness of the Arabic language by supporting Modern Standard Arabic (MSA) and a diverse range of Arabic dialects, such as Gulf, Levantine, and Egyptian. Through meticulous curation of pretraining and instruction-tuning data, Fanar models are aligned with Islamic values and Arab cultures.

Fanar-1-9B-Instruct is a core part of the Fanar GenAI platform, which offers a variety of capabilities, including image generation, video and image understanding, deep thinking, advanced text-to-speech (TTS) and automatic-speech-recognition (ASR), attribution and fact-checking, Islamic RAG, and several other features.

We have published a comprehensive report with all the details about our Fanar GenAI platform. We also provide an API to our models and the GenAI platform (request access here).

🚀 Quick Start

Fanar-1-9B-Instruct is compatible with the Hugging Face transformers library (≥ v4.40.0). Here's how to load and use the model:

from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "QCRI/Fanar-1-9B-Instruct"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")

# message content may be in Arabic or English
messages = [
    {"role": "user", "content": "ما هي عاصمة قطر؟"},
]

inputs = tokenizer.apply_chat_template(messages, tokenize=False, return_tensors="pt")
outputs = model.generate(**tokenizer(inputs, return_tensors="pt", return_token_type_ids=False), max_new_tokens=256)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Inference using VLLM is also supported:

from vllm import LLM, SamplingParams

model_name = "QCRI/Fanar-1-9B-Instruct"

llm = LLM(model=model_name)
sampling_params = SamplingParams(temperature=0.7, max_tokens=256)

# message content may be in Arabic or English
messages = [
    {"role": "user", "content": "ما هي عاصمة قطر؟"},
]

outputs = llm.chat(messages, sampling_params)
print(outputs[0].outputs[0].text)

✨ Features

Multilingual Support: Supports both Arabic and English, with a focus on the richness of the Arabic language, including Modern Standard Arabic (MSA) and various dialects.
Cultural Alignment: Aligned with Islamic values and Arab cultures through meticulous data curation.
Comprehensive Capabilities: Part of the Fanar GenAI platform with a suite of features such as image generation, video and image understanding, etc.

📦 Installation

The model can be installed using the Hugging Face transformers library. Make sure you have transformers version ≥ v4.40.0 installed. You can install it using the following command:

pip install transformers>=4.40.0

📚 Documentation

We have published a comprehensive report with all the details regarding our Fanar GenAI platform. We also provide an API to our models and the GenAI platform (request access here).

🔧 Technical Details

Model Details

Property	Details
Developed by	QCRI at HBKU
Sponsored by	Ministry of Communications and Information Technology, State of Qatar
Model Type	Autoregressive Transformer
Parameter Count	8.7 Billion
Context Length	4096 Tokens
Input	Text only
Output	Text only
Training Framework	LitGPT
Pretraining Token Count	1 Trillion (ar + en)
SFT Instructions	4.5M
DPO Preference Pairs	250K
Languages	Arabic, English
License	Apache 2.0

Model Training

Pretraining

Fanar-1-9B-Instruct was continually pretrained on 1T tokens, with a balanced focus on Arabic and English: ~515B English tokens from a carefully curated subset of the Dolma dataset, 410B Arabic tokens that we collected, parsed, and filtered from a variety of sources, and 102B code tokens curated from The Stack dataset. Our codebase used the LitGPT framework.

Post-training

Fanar-1-9B-Instruct underwent a two-phase post-training pipeline:

Phase	Size
Supervised Fine-tuning (SFT)	4.5M Instructions
Direct Preference Optimization (DPO)	250K Preference Pairs

📄 License

This model is licensed under the Apache 2.0 License.

💻 Usage Examples

Basic Usage

from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "QCRI/Fanar-1-9B-Instruct"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")

# message content may be in Arabic or English
messages = [
    {"role": "user", "content": "ما هي عاصمة قطر؟"},
]

inputs = tokenizer.apply_chat_template(messages, tokenize=False, return_tensors="pt")
outputs = model.generate(**tokenizer(inputs, return_tensors="pt", return_token_type_ids=False), max_new_tokens=256)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Advanced Usage

from vllm import LLM, SamplingParams

model_name = "QCRI/Fanar-1-9B-Instruct"

llm = LLM(model=model_name)
sampling_params = SamplingParams(temperature=0.7, max_tokens=256)

# message content may be in Arabic or English
messages = [
    {"role": "user", "content": "ما هي عاصمة قطر؟"},
]

outputs = llm.chat(messages, sampling_params)
print(outputs[0].outputs[0].text)

Intended Use

Fanar-1-9B-Instruct is built for:

Conversational agents (Arabic only or bilingual)
Cultural and dialectal question answering in Arabic
Educational, governmental, and civic NLP applications focused on the Arab world or Arabic-speaking audiences
Research on Arabic natural language generation and understanding

Fanar-1-9B-Instruct can be deployed as part of a broader AI system. Developers are encouraged to implement proper safeguards to ensure culturally respectful, accurate, and safe deployment. It should not be used to generate or spread harmful, illegal, or misleading content.

A version of this model can be accessed through Fanar Chat. We are continuously improving the Fanar’s models and capabilities, and answers can differ from what you get from Fanar-1-9B-Instruct.

Ethical Considerations & Limitations

Fanar-1-9B-Instruct is capable of generating fluent and contextually appropriate responses. However, as with any generative model there are uncertainties. The model may produce biased, offensive, or incorrect outputs. The model is not suitable for high-stakes decision-making (e.g., legal, medical, or financial advice). Though we have extensively tested Fanar-1-9B-Instruct and attempted to mitigate these issues, we cannot redress every possible scenario. Thus, we advise developers to implement safety checks and perform domain-specific fine-tuning for sensitive use cases. Kindly refer to our Terms of Service and Privacy Policy.

The output generated by this model is not considered a statement of QCRI, HBKU, Qatar Foundation, MCIT or any other organization or individual.

Evaluation

Evaluation was conducted using a modified version of the LM Evaluation Harness and internal cultural alignment benchmarks.

Model	MMLU (5-shot)	MMMLU (Arabic) (0-shot)	ArabicMMLU (3-shot)	HellaSwag (0-shot)	PIQA (0-shot)	ARC Challenge (0-shot)	Belebele (Arabic) (3-shot)	ACVA (5-shot)	GSM8k	OALL (0-shot)	OALL v2 (0-shot)	Almieyar Arabic (3-shot)	Arab Cultural MCQ (3-shot)	AraDiCE PIQA (MSA) (0-shot)	AraDiCE PIQA(Egy) (0-shot)	AraDiCE PIQA(Lev) (0-shot)	AraDiCE ArabicMMLU(Egy) (0-shot)	AraDiCE ArabicMMLU(Lev) (0-shot)
Fanar-1-9B-it	71.53%	58.89%	67.69%	83.16%	82.54%	67.15%	83.22%	80.02%	74.60%	68.32%	66.29%	78.68%	72.40%	67.68%	63.66%	59.03%	59.63%	60.62%
ALLaM-7B-Instruct-preview	60.72%	54.89%	68.59%	76.35%	80.52%	51.62%	75.80%	74.52%	46.63%	57.31%	63.66%	76.31%	74.20%	67.52%	63.44%	60.88%	62.50%	64.17%
aya-expanse-8b	62.85%	47.14%	60.10%	78.54%	81.18%	56.40%	70.78%	77.11%	8.26%	53.18%	59.74%	70.20%	67.30%	63.00%	59.41%	56.53%	53.52%	53.71%
c4ai-command-r7b-arabic-02-2025	66.91%	49.54%	63.06%	74.67%	78.02%	49.15%	72.78%	79.80%	30.33%	49.38%	64.44%	73.82%	69.20%	62.30%	60.99%	56.69%	54.78%	56.06%
AceGPT-v2-8B-Chat	66.45%	51.16%	62.61%	79.21%	80.58%	53.50%	74.56%	77.66%	41.77%	50.16%	60.40%	74.31%	68.90%	64.58%	61.32%	56.91%	54.53%	53.91%
gemma-2-9b-it	71.65%	57.93%	64.16%	79.06%	79.38%	63.99%	78.31%	80.67%	60.95%	56.11%	64.21%	73.69%	68.60%	61.26%	59.96%	57.24%	57.95%	59.25%
jais-adapted-13b-chat	56.64%	44.45%	58.97%	80.86%	80.47%	54.27%	67.52%	75.24%	44.05%	46.41%	56.56%	65.46%	65.30%	61.10%	58.05%	55.77%	52.87%	53.59%
jais-family-6p7b-chat	49.42%	41.59%	55.80%	72.04%	74.05%	44.62%	65.11%	72.04%	53.68%	48.20%	54.73%	61.72%	64.10%	62.51%	60.12%	57.24%	49.11%	47.49%
Llama-3.1-8B-Instruct	68.04%	47.58%	59.05%	79.22%	80.74%	55.29%	66.72%	76.67%	29.26%	47.81%	55.97%	69.70%	66.10%	58.11%	55.39%	54.24%	46.86%	47.52%
Qwen2.5-7B-Instruct	74.21%	55.63%	63.96%	80.44%	79.92%	55.03%	74.61%	78.09%	71.34%	54.19%	62.69%	75.69%	68.10%	60.55%	58.65%	56.04%	48.74%	53.42%

Citation

If you use Fanar-1-9B-Instruct or the Fanar GenAI system in your research or applications, please cite:

@misc{fanarllm2025,
      title={Fanar: An Arabic-Centric Multimodal Generative AI Platform}, 
      author={Fanar Team and Ummar Abbas and Mohammad Shahmeer Ahmad and Firoj Alam and Enes Altinisik and Ehsannedin Asgari and Yazan Boshmaf and Sabri Boughorbel and Sanjay Chawla and Shammur Chowdhury and Fahim Dalvi and Kareem Darwish and Nadir Durrani and Mohamed Elfeky and Ahmed Elmagarmid and Mohamed Eltabakh and Masoomali Fatehkia and Anastasios Fragkopoulos and Maram Hasanain and Majd Hawasly and Mus'ab Husaini and Soon-Gyo Jung and Ji Kim Lucas and Walid Magdy and Safa Messaoud and Abubakr Mohamed and Tasnim Mohiuddin and Basel Mousi and Hamdy Mubarak and Ahmad Musleh and Zan Naeem and Mourad Ouzzani and Dorde Popovic and Amin Sadeghi and Husrev Taha Sencar and Mohammed Shinoy and Omar Sinan and Yifan Zhang and Ahmed Ali and Yassine El Kheir and Xiaosong Ma and Chaoyi Ruan}},
      year={2025},
      url={https://arxiv.org/abs/2501.13944}, 
}

Acknowledgements

This project is from Qatar Computing Research Institute (QCRI) at Hamad Bin Khalifa University (HBKU), a member of Qatar Foundation. We thank our engineers, researchers, and support team for their efforts in advancing Arabic-centric large language models. Special thanks to the Ministry of Communications and Information Technology, State of Qatar for their continued support by providing the compute infrastructure through the Google Cloud Platform.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご