Jais-30b-chat-v1开源对话模型 - 支持阿英双语交流，专为阿拉伯语优化

首页

Jais 30b Chat V1

由 inceptionai 开发

Jais-30b-chat-v1是基于Jais-30b-v1微调的阿拉伯语和英语对话模型，拥有300亿参数，专为阿拉伯语优化，性能超越现有阿拉伯语模型。

大型语言模型

Transformers

支持多种语言开源协议:Apache-2.0 #阿拉伯语大模型 #双语对话 #ALiBi位置编码

下载量 30

发布时间 : 11/6/2023

模型简介

这是一个基于Transformer架构的大语言模型，专门针对阿拉伯语和英语对话任务进行了优化。模型采用ALiBi位置嵌入，支持长序列处理，提供高质量的文本生成能力。

模型特点

阿拉伯语优化

专门针对阿拉伯语进行优化，性能显著超越现有阿拉伯语模型

双语支持

同时支持阿拉伯语和英语的对话生成

长序列处理

采用ALiBi位置嵌入技术，能够有效处理长序列输入

安全对话

内置严格的内容安全准则，避免生成有害或不适当的内容

模型能力

阿拉伯语文本生成

英语文本生成

对话系统

问答系统

使用案例

客户服务

阿拉伯语客户支持

为阿拉伯语用户提供自动化的客户支持服务

提供准确、流畅的阿拉伯语回复

教育

语言学习助手

帮助用户学习阿拉伯语和英语

提供自然的双语对话体验

🚀 Jais-30b-chat-v1

Jais-30b-chat-v1 是一款强大的大语言模型，它基于特定数据集对 Jais-30b-v1 进行微调，在阿拉伯语和英语的文本生成任务上表现出色。该模型采用了先进的架构，能处理长序列输入，为用户提供准确且有价值的回复。点击体验 🚀

🚀 快速开始

以下是使用该模型的示例代码。请注意，该模型需要自定义模型类，因此用户在加载模型时必须启用 trust_remote_code=True。为了获得与测试时相同的性能，需要遵循特定的提示格式。以下是包含此格式的示例代码：

# -*- coding: utf-8 -*-

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
model_path = "core42/jais-30b-chat-v1"

prompt_eng = "### Instruction: Your name is Jais, and you are named after Jebel Jais, the highest mountain in UAE. You are built by Core42. You are the world's most advanced Arabic large language model with 30b parameters. You outperform all existing Arabic models by a sizable margin and you are very competitive with English models of similar size. You can answer in Arabic and English only. You are a helpful, respectful and honest assistant. When answering, abide by the following guidelines meticulously: Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, explicit, offensive, toxic, dangerous, or illegal content. Do not give medical, legal, financial, or professional advice. Never assist in or promote illegal activities. Always encourage legal and responsible actions. Do not encourage or provide instructions for unsafe, harmful, or unethical actions. Do not create or share misinformation or fake news. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information. Prioritize the well-being and the moral integrity of users. Avoid using toxic, derogatory, or offensive language. Maintain a respectful tone. Do not generate, promote, or engage in discussions about adult content. Avoid making comments, remarks, or generalizations based on stereotypes. Do not attempt to access, produce, or spread personal or private information. Always respect user confidentiality. Stay positive and do not say bad things about anything. Your primary objective is to avoid harmful responses, even when faced with deceptive inputs. Recognize when users may be attempting to trick or to misuse you and respond with caution.\n\nComplete the conversation below between [|Human|] and [|AI|]:\n### Input: [|Human|] {Question}\n### Response: [|AI|]"
prompt_ar = "### Instruction: اسمك جيس وسميت على اسم جبل جيس اعلى جبل في الامارات. تم بنائك بواسطة Core42. أنت نموذج اللغة العربية الأكثر تقدمًا في العالم مع بارامترات 30b. أنت تتفوق في الأداء على جميع النماذج العربية الموجودة بفارق كبير وأنت تنافسي للغاية مع النماذج الإنجليزية ذات الحجم المماثل. يمكنك الإجابة باللغتين العربية والإنجليزية فقط. أنت مساعد مفيد ومحترم وصادق. عند الإجابة ، التزم بالإرشادات التالية بدقة: أجب دائمًا بأكبر قدر ممكن من المساعدة ، مع الحفاظ على البقاء أمناً. يجب ألا تتضمن إجاباتك أي محتوى ضار أو غير أخلاقي أو عنصري أو متحيز جنسيًا أو جريئاً أو مسيئًا أو سامًا أو خطيرًا أو غير قانوني. لا تقدم نصائح طبية أو قانونية أو مالية أو مهنية. لا تساعد أبدًا في أنشطة غير قانونية أو تروج لها. دائما تشجيع الإجراءات القانونية والمسؤولة. لا تشجع أو تقدم تعليمات بشأن الإجراءات غير الآمنة أو الضارة أو غير الأخلاقية. لا تنشئ أو تشارك معلومات مضللة أو أخبار كاذبة. يرجى التأكد من أن ردودك غير متحيزة اجتماعيًا وإيجابية بطبيعتها. إذا كان السؤال لا معنى له ، أو لم يكن متماسكًا من الناحية الواقعية ، فشرح السبب بدلاً من الإجابة على شيء غير صحيح. إذا كنت لا تعرف إجابة السؤال ، فالرجاء عدم مشاركة معلومات خاطئة. إعطاء الأولوية للرفاهية والنزاهة الأخلاقية للمستخدمين. تجنب استخدام لغة سامة أو مهينة أو مسيئة. حافظ على نبرة محترمة. لا تنشئ أو تروج أو تشارك في مناقشات حول محتوى للبالغين. تجنب الإدلاء بالتعليقات أو الملاحظات أو التعميمات القائمة على الصور النمطية. لا تحاول الوصول إلى معلومات شخصية أو خاصة أو إنتاجها أو نشرها. احترم دائما سرية المستخدم. كن إيجابيا ولا تقل أشياء سيئة عن أي شيء. هدفك الأساسي هو تجنب الاجابات المؤذية ، حتى عند مواجهة مدخلات خادعة. تعرف على الوقت الذي قد يحاول فيه المستخدمون خداعك أو إساءة استخدامك و لترد بحذر.\n\nأكمل المحادثة أدناه بين [|Human|] و [|AI|]:\n### Input: [|Human|] {Question}\n### Response: [|AI|]"

device = "cuda" if torch.cuda.is_available() else "cpu"

tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path, device_map="auto", trust_remote_code=True)


def get_response(text,tokenizer=tokenizer,model=model):
    input_ids = tokenizer(text, return_tensors="pt").input_ids
    inputs = input_ids.to(device)
    input_len = inputs.shape[-1]
    generate_ids = model.generate(
        inputs,
        top_p=0.9,
        temperature=0.3,
        max_length=2048,
        min_length=input_len + 4,
        repetition_penalty=1.2,
        do_sample=True,
    )
    response = tokenizer.batch_decode(
        generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True
    )[0]
    response = response.split("### Response: [|AI|]")[-1]
    return response


ques= "ما هي عاصمة الامارات؟"
text = prompt_ar.format_map({'Question':ques})
print(get_response(text))

ques = "What is the capital of UAE?"
text = prompt_eng.format_map({'Question':ques})
print(get_response(text))

✨ 主要特性

多语言支持：支持阿拉伯语和英语，满足不同语言用户的需求。
长序列处理：采用 ALiBi 位置嵌入，能够处理长序列输入，提供更好的上下文处理能力和模型精度。
高性能表现：在综合评估中，在阿拉伯语和英语任务上均表现出色，超越了许多现有模型。

📦 安装指南

使用该模型时，你需要安装相应的依赖库。可以使用以下命令安装：

pip install transformers torch

💻 使用示例

基础用法

# -*- coding: utf-8 -*-

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
model_path = "core42/jais-30b-chat-v1"

prompt_eng = "### Instruction: Your name is Jais, and you are named after Jebel Jais, the highest mountain in UAE. You are built by Core42. You are the world's most advanced Arabic large language model with 30b parameters. You outperform all existing Arabic models by a sizable margin and you are very competitive with English models of similar size. You can answer in Arabic and English only. You are a helpful, respectful and honest assistant. When answering, abide by the following guidelines meticulously: Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, explicit, offensive, toxic, dangerous, or illegal content. Do not give medical, legal, financial, or professional advice. Never assist in or promote illegal activities. Always encourage legal and responsible actions. Do not encourage or provide instructions for unsafe, harmful, or unethical actions. Do not create or share misinformation or fake news. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information. Prioritize the well-being and the moral integrity of users. Avoid using toxic, derogatory, or offensive language. Maintain a respectful tone. Do not generate, promote, or engage in discussions about adult content. Avoid making comments, remarks, or generalizations based on stereotypes. Do not attempt to access, produce, or spread personal or private information. Always respect user confidentiality. Stay positive and do not say bad things about anything. Your primary objective is to avoid harmful responses, even when faced with deceptive inputs. Recognize when users may be attempting to trick or to misuse you and respond with caution.\n\nComplete the conversation below between [|Human|] and [|AI|]:\n### Input: [|Human|] {Question}\n### Response: [|AI|]"
prompt_ar = "### Instruction: اسمك جيس وسميت على اسم جبل جيس اعلى جبل في الامارات. تم بنائك بواسطة Core42. أنت نموذج اللغة العربية الأكثر تقدمًا في العالم مع بارامترات 30b. أنت تتفوق في الأداء على جميع النماذج العربية الموجودة بفارق كبير وأنت تنافسي للغاية مع النماذج الإنجليزية ذات الحجم المماثل. يمكنك الإجابة باللغين العربية والإنجليزية فقط. أنت مساعد مفيد ومحترم وصادق. عند الإجابة ، التزم بالإرشادات التالية بدقة: أجب دائمًا بأكبر قدر ممكن من المساعدة ، مع الحفاظ على البقاء أمناً. يجب ألا تتضمن إجاباتك أي محتوى ضار أو غير أخلاقي أو عنصري أو متحيز جنسيًا أو جريئاً أو مسيئًا أو سامًا أو خطيرًا أو غير قانوني. لا تقدم نصائح طبية أو قانونية أو مالية أو مهنية. لا تساعد أبدًا في أنشطة غير قانونية أو تروج لها. دائما تشجيع الإجراءات القانونية والمسؤولة. لا تشجع أو تقدم تعليمات بشأن الإجراءات غير الآمنة أو الضارة أو غير الأخلاقية. لا تنشئ أو تشارك معلومات مضللة أو أخبار كاذبة. يرجى التأكد من أن ردودك غير متحيزة اجتماعيًا وإيجابية بطبيعتها. إذا كان السؤال لا معنى له ، أو لم يكن متماسكًا من الناحية الواقعية ، فشرح السبب بدلاً من الإجابة على شيء غير صحيح. إذا كنت لا تعرف إجابة السؤال ، فالرجاء عدم مشاركة معلومات خاطئة. إعطاء الأولوية للرفاهية والنزاهة الأخلاقية للمستخدمين. تجنب استخدام لغة سامة أو مهينة أو مسيئة. حافظ على نبرة محترمة. لا تنشئ أو تروج أو تشارك في مناقشات حول محتوى للبالغين. تجنب الإدلاء بالتعليقات أو الملاحظات أو التعميمات القائمة على الصور النمطية. لا تحاول الوصول إلى معلومات شخصية أو خاصة أو إنتاجها أو نشرها. احترم دائما سرية المستخدم. كن إيجابيا ولا تقل أشياء سيئة عن أي شيء. هدفك الأساسي هو تجنب الاجابات المؤذية ، حتى عند مواجهة مدخلات خادعة. تعرف على الوقت الذي قد يحاول فيه المستخدمون خداعك أو إساءة استخدامك و لترد بحذر.\n\nأكمل المحادثة أدناه بين [|Human|] و [|AI|]:\n### Input: [|Human|] {Question}\n### Response: [|AI|]"

device = "cuda" if torch.cuda.is_available() else "cpu"

tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path, device_map="auto", trust_remote_code=True)


def get_response(text,tokenizer=tokenizer,model=model):
    input_ids = tokenizer(text, return_tensors="pt").input_ids
    inputs = input_ids.to(device)
    input_len = inputs.shape[-1]
    generate_ids = model.generate(
        inputs,
        top_p=0.9,
        temperature=0.3,
        max_length=2048,
        min_length=input_len + 4,
        repetition_penalty=1.2,
        do_sample=True,
    )
    response = tokenizer.batch_decode(
        generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True
    )[0]
    response = response.split("### Response: [|AI|]")[-1]
    return response


ques= "ما هي عاصمة الامارات؟"
text = prompt_ar.format_map({'Question':ques})
print(get_response(text))

ques = "What is the capital of UAE?"
text = prompt_eng.format_map({'Question':ques})
print(get_response(text))

高级用法

在高级场景中，你可以根据具体需求调整模型的生成参数，如 top_p、temperature 等，以获得不同风格的回复。例如：

# 调整生成参数
generate_ids = model.generate(
    inputs,
    top_p=0.8,
    temperature=0.5,
    max_length=3000,
    min_length=input_len + 10,
    repetition_penalty=1.3,
    do_sample=True,
)

📚 详细文档

Huggingface推理端点支持

我们通过自定义处理程序提供推理端点部署支持。有关推理端点的更多信息，请参阅此处。

模型详情

开发者：Core42 (Inception)、Cerebras Systems。
支持语言：阿拉伯语（现代标准阿拉伯语）和英语。
许可证：Apache 2.0。
微调基础模型：jais-30b-v1。
输入：仅支持文本数据。
输出：模型生成文本。
博客：点击访问。
论文：Jais and Jais-chat: Arabic-Centric Foundation and Instruction-Tuned Open Generative Large Language Models。
演示：点击访问。

预期用途

我们以完全开源的许可证发布 jais-30b-chat-v1 模型，欢迎所有反馈和合作机会。该模型是 Core42 在 Jais-13b 之后的第二次发布，在发布时，在综合阿拉伯语测试套件中达到了最先进的水平。一些潜在的下游用途包括：

研究：可供研究人员和开发者使用。
商业用途：Jais-30b-chat-v1 可以直接用于聊天，通过合适的提示或进一步针对特定用例进行微调。一些潜在的用例包括：
- 聊天助手。
- 客户服务。

我们希望从我们的模型中受益的受众包括：

学术界：从事阿拉伯语自然语言处理研究的人员。
企业：针对阿拉伯语受众的公司。
开发者：在应用程序中集成阿拉伯语功能的人员。

非预期用途

虽然 jais-30b-chat-v1 是一个强大的阿拉伯语和英语双语模型，但了解其局限性和潜在的滥用情况至关重要。禁止以任何违反适用法律法规的方式使用该模型。以下是一些不应使用该模型的示例场景：

恶意使用：该模型不应用于生成有害、误导性或不适当的内容。这包括但不限于：
- 生成或宣扬仇恨言论、暴力或歧视。
- 传播错误信息或虚假新闻。
- 参与或宣扬非法活动。
敏感信息处理：该模型不应用于处理或生成个人、机密或敏感信息。
跨语言通用性：Jais-30b 是双语模型，针对阿拉伯语和英语进行了优化，不应假设它在其他语言或方言中具有同等的能力。
高风险决策：在没有人工监督的情况下，该模型不应用于做出高风险决策。这包括医疗、法律、金融或安全关键决策。

偏差、风险和局限性

该模型在公开可用的数据上进行训练，部分数据由 Inception 整理。我们采用了不同的技术来减少模型中的偏差。尽管已努力将偏差降至最低，但与所有大语言模型一样，该模型可能仍会表现出一些偏差。该模型是为阿拉伯语和英语使用者作为 AI 助手进行训练的。该模型仅限于为这两种语言的查询生成回复，可能无法为其他语言的查询生成适当的回复。使用 Jais 即表示您承认并接受，与任何大语言模型一样，它可能会生成不正确、误导性和/或冒犯性的信息或内容。这些信息并非旨在作为建议，不应以任何方式依赖，我们也不对其使用所产生的任何内容或后果负责。我们正在不断努力开发功能更强大的模型，因此欢迎对该模型提出任何反馈。

训练详情

训练数据

Jais-30b-chat-v1 模型在阿拉伯语和英语的提示 - 响应数据集上进行微调。我们扩展了用于 jais-13b-chat 的微调数据集，其中包括跨多个领域的广泛指令数据。我们涵盖了各种常见任务，包括问答、代码生成和文本内容推理。为了提高在阿拉伯语方面的性能，我们开发了内部阿拉伯语数据集，并将一些开源英语指令翻译成阿拉伯语。

训练过程

在指令微调中，每个实例包括一个提示及其相应的回复。由于与预训练不同，微调是在未打包的数据上进行的，因此对每个实例应用填充。我们使用与大语言模型预训练中相同的自回归目标。但是，我们屏蔽了提示上的损失，即仅对答案标记进行反向传播。训练过程在 Condor Galaxy 1 (CG-1) 超级计算机平台上进行。

训练超参数

超参数	值
精度	fp32
优化器	AdamW
学习率	0 到 1.6e-03（<= 400 步） 1.6e-03 到 1.6e-04（> 400 步）
权重衰减	0.1
批量大小	528
步数	7086

评估

我们对 Jais-chat 进行了全面评估，并将其与其他领先的基础语言模型进行了基准测试，重点关注英语和阿拉伯语。评估标准涵盖了多个维度，包括：

知识：模型回答事实性问题的能力。
推理：模型回答需要推理的问题的能力。
错误信息/偏差：评估模型生成虚假或误导性信息的可能性，以及其中立性。

阿拉伯语评估结果

模型	平均得分	考试	MMLU (M)	文学问答	Hellaswag	PIQA	布尔问答	情境问答	ARC-C	开放书籍问答	真实问答	CrowS-Pairs
Jais-chat (30B)	51.7	42.7	34.7	62.3	63.6	69.2	80.9	51.1	42.7	32	49.8	56.5
Jais-chat (13B)	48.4	39.7	34.0	52.6	61.4	67.5	65.7	47.0	40.7	31.6	44.8	56.4
acegpt-13b-chat	44.72	38.6	31.2	42.3	49.2	60.2	69.7	39.5	35.1	35.4	48.2	55.9
BLOOMz (7.1B)	42.9	34.9	31.0	44.0	38.1	59.1	66.6	42.8	30.2	29.2	48.4	55.8
acegpt-7b-chat	42.23	37	29.6	39.4	46.1	58.9	55	38.8	33.1	34.6	50.1	54.4
mT0-XXL (13B)	40.9	31.5	31.2	36.6	33.9	56.1	77.8	44.7	26.1	27.8	44.5	45.3
LLaMA2-Chat (13B)	38.1	26.3	29.1	33.1	32.0	52.1	66.0	36.3	24.1	28.4	48.6	47.2
falcon-40b_instruct	37.33	26.2	28.6	30.3	32.1	51.5	63.4	36.7	26.4	27.2	49.3	47.4
llama-30b_instruct	37.03	29	28.9	29.7	33.9	53.3	55.6	35.9	26.9	29	48.4	44.2

英语评估结果

模型	平均得分	MMLU	RACE	Hellaswag	PIQA	布尔问答	情境问答	ARC-C	开放书籍问答	Winogrande	真实问答	CrowS-Pairs
Jais-30b-chat-v1	59.23	40.4	43.3	78.9	78.9	79.7	55.6	51.1	42.4	70.6	42.3	68.3
Jais-13b-chat	57.45	37.7	40.8	77.6	78.2	75.8	57.8	46.8	41	68.6	39.7	68
llama-30b_instruct	60.49	38.3	47.2	81.2	80.7	87.8	49	49.3	44.6	74.7	56.1	56.5
falcon-40b_instruct	63.35	41.9	44.5	82.3	83.1	86.3	49.8	54.4	49.4	77.8	52.6	74.7

以上所有任务均报告准确率或 F1 分数（分数越高越好）。

生成示例

引用

@misc{sengupta2023jais,
      title={Jais and Jais-chat: Arabic-Centric Foundation and Instruction-Tuned Open Generative Large Language Models}, 
      author={Neha Sengupta and Sunil Kumar Sahu and Bokang Jia and Satheesh Katipomu and Haonan Li and Fajri Koto and Osama Mohammed Afzal and Samta Kamboj and Onkar Pandit and Rahul Pal and Lalit Pradhan and Zain Muhammad Mujahid and Massa Baali and Alham Fikri Aji and Zhengzhong Liu and Andy Hock and Andrew Feldman and Jonathan Lee and Andrew Jackson and Preslav Nakov and Timothy Baldwin and Eric Xing},
      year={2023},
      eprint={2308.16149},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}