AI21-Jamba-Large-1.5开源基础AI模型 - 高效处理长内容，适用多业务场景

首页

AI21 Jamba Large 1.5

由 ai21labs 开发

AI21 Jamba 1.5是一系列先进的基础模型，具备强大的长上下文处理能力和高效的推理速度，适用于多种业务场景。

大型语言模型

Safetensors

支持多种语言开源协议:其他 #长上下文处理 #多语言支持 #高效推理

下载量 2,642

发布时间 : 8/19/2024

模型简介

AI21 Jamba 1.5系列模型是最先进的混合SSM - Transformer指令跟随基础模型，支持多种语言，适用于函数调用、结构化输出等业务场景。

模型特点

先进架构

混合SSM - Transformer架构，结合了状态空间模型和Transformer的优势。

高效推理

推理速度比同类领先模型快达2.5倍，支持长上下文处理。

多语言支持

支持9种语言，包括英语、西班牙语、法语等。

业务优化

针对函数调用、结构化输出（JSON）和有根据的生成等业务用例进行了优化。

灵活部署

支持在单节点8个80GB GPU上部署，提供高效的量化技术ExpertsInt8。

模型能力

文本生成

函数调用

结构化输出

多语言处理

长上下文理解

使用案例

业务应用

函数调用

支持在业务场景中调用外部函数，实现自动化任务。

结构化输出

生成JSON格式的结构化输出，便于业务系统集成。

多语言处理

多语言文本生成

支持多种语言的文本生成，适用于国际化业务场景。

🚀 AI21 Jamba 1.5模型

AI21 Jamba 1.5是一系列先进的基础模型，具备强大的长上下文处理能力和高效的推理速度，适用于多种业务场景，如函数调用、结构化输出等。该系列模型在多个基准测试中表现优异，且支持多种语言。

🚀 快速开始

请注意，此版本将于2024年5月6日停用。我们建议您过渡到新版本，可点击此处获取。

✨ 主要特性

先进架构：AI21 Jamba 1.5系列模型是最先进的混合SSM - Transformer指令跟随基础模型。
高效推理：是市场上最强大、最高效的长上下文模型，推理速度比同类领先模型快达2.5倍。
多语言支持：支持英语、西班牙语、法语、葡萄牙语、意大利语、荷兰语、德语、阿拉伯语和希伯来语。
业务优化：针对业务用例和功能进行了优化，如函数调用、结构化输出（JSON）和有根据的生成。
灵活部署：开发了适用于vLLM中部署的MoE模型的创新高效量化技术ExpertsInt8，可在单节点8个80GB GPU上部署Jamba 1.5 Large。

📦 安装指南

运行优化的Mamba实现

要运行优化的Mamba实现，首先需要安装mamba-ssm和causal-conv1d：

pip install mamba-ssm causal-conv1d>=1.2.0

同时，模型必须部署在CUDA设备上。

安装vLLM

使用vLLM对Jamba 1.5 Large进行高效推理的推荐方式。首先，确保安装vLLM（需要0.5.5或更高版本）：

pip install vllm>=0.5.5

安装其他依赖

在使用HF框架 + axolotl和FSDP对Jamba 1.5 Large进行微调时，需要安装以下依赖：

git clone https://github.com/axolotl-ai-cloud/axolotl
cd axolotl
pip3 install packaging ninja
pip3 install -e '.[flash-attn,deepspeed]'

pip install bitsandbytes~=0.43.3
pip install trl
pip install peft~=0.12.0
pip install accelerate~=0.33.0
pip install mamba-ssm causal-conv1d>=1.2.0
pip install git+https://github.com/xgal/transformers@897f80665c37c531b7803f92655db

💻 使用示例

基础用法

使用vLLM进行推理

from vllm import LLM, SamplingParams
from transformers import AutoTokenizer

model = "ai21labs/AI21-Jamba-1.5-Large"

llm = LLM(model=model,
          tensor_parallel_size=8,
          max_model_len=220*1024,
          quantization="experts_int8",
         )

tokenizer = AutoTokenizer.from_pretrained(model)

messages = [
   {"role": "system", "content": "You are an ancient oracle who speaks in cryptic but wise phrases, always hinting at deeper meanings."},
   {"role": "user", "content": "Hello!"},
]

prompts = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)

sampling_params = SamplingParams(temperature=0.4, top_p=0.95, max_tokens=100)
outputs = llm.generate(prompts, sampling_params)

generated_text = outputs[0].outputs[0].text
print(generated_text)

使用`transformers`加载模型

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(load_in_8bit=True,
                                         llm_int8_skip_modules=["mamba"])

# a device map to distribute the model evenly across 8 GPUs
device_map = {'model.embed_tokens': 0, 'model.layers.0': 0, 'model.layers.1': 0, 'model.layers.2': 0, 'model.layers.3': 0, 'model.layers.4': 0, 'model.layers.5': 0, 'model.layers.6': 0, 'model.layers.7': 0, 'model.layers.8': 0, 'model.layers.9': 1, 'model.layers.10': 1, 'model.layers.11': 1, 'model.layers.12': 1, 'model.layers.13': 1, 'model.layers.14': 1, 'model.layers.15': 1, 'model.layers.16': 1, 'model.layers.17': 1, 'model.layers.18': 2, 'model.layers.19': 2, 'model.layers.20': 2, 'model.layers.21': 2, 'model.layers.22': 2, 'model.layers.23': 2, 'model.layers.24': 2, 'model.layers.25': 2, 'model.layers.26': 2, 'model.layers.27': 3, 'model.layers.28': 3, 'model.layers.29': 3, 'model.layers.30': 3, 'model.layers.31': 3, 'model.layers.32': 3, 'model.layers.33': 3, 'model.layers.34': 3, 'model.layers.35': 3, 'model.layers.36': 4, 'model.layers.37': 4, 'model.layers.38': 4, 'model.layers.39': 4, 'model.layers.40': 4, 'model.layers.41': 4, 'model.layers.42': 4, 'model.layers.43': 4, 'model.layers.44': 4, 'model.layers.45': 5, 'model.layers.46': 5, 'model.layers.47': 5, 'model.layers.48': 5, 'model.layers.49': 5, 'model.layers.50': 5, 'model.layers.51': 5, 'model.layers.52': 5, 'model.layers.53': 5, 'model.layers.54': 6, 'model.layers.55': 6, 'model.layers.56': 6, 'model.layers.57': 6, 'model.layers.58': 6, 'model.layers.59': 6, 'model.layers.60': 6, 'model.layers.61': 6, 'model.layers.62': 6, 'model.layers.63': 7, 'model.layers.64': 7, 'model.layers.65': 7, 'model.layers.66': 7, 'model.layers.67': 7, 'model.layers.68': 7, 'model.layers.69': 7, 'model.layers.70': 7, 'model.layers.71': 7, 'model.final_layernorm': 7, 'lm_head': 7}
model = AutoModelForCausalLM.from_pretrained("ai21labs/AI21-Jamba-1.5-Large",
                                             torch_dtype=torch.bfloat16,
                                             attn_implementation="flash_attention_2",
                                             quantization_config=quantization_config,
                                             device_map=device_map)

tokenizer = AutoTokenizer.from_pretrained("ai21labs/AI21-Jamba-1.5-Large")

messages = [
   {"role": "system", "content": "You are an ancient oracle who speaks in cryptic but wise phrases, always hinting at deeper meanings."},
   {"role": "user", "content": "Hello!"},
]

input_ids = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors='pt').to(model.device)

outputs = model.generate(input_ids, max_new_tokens=216)

# Decode the output
conversation = tokenizer.decode(outputs[0], skip_special_tokens=True)

# Split the conversation to get only the assistant's response
assistant_response = conversation.split(messages[-1]['content'])[1].strip()
print(assistant_response)
# Output: Seek and you shall find. The path is winding, but the journey is enlightening. What wisdom do you seek from the ancient echoes?

高级用法

在CPU上加载模型

from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("ai21labs/AI21-Jamba-1.5-Large",
                                             use_mamba_kernels=False)

工具使用示例

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("ai21labs/AI21-Jamba-1.5-Large")

messages = [
    {
        "role": "user", 
        "content": "What's the weather like right now in Jerusalem and in London?"
    }
]

tools = [
    {
        'type': 'function', 
        'function': {
            'name': 'get_current_weather', 
            'description': 'Get the current weather', 
            'parameters': {
                'type': 'object', 
                'properties': {
                    'location': {'type': 'string', 'description': 'The city and state, e.g. San Francisco, CA'}, 
                    'format': {'type': 'string', 'enum': ['celsius', 'fahrenheit'], 'description': 'The temperature unit to use. Infer this from the users location.'}
                }, 
                'required': ['location', 'format']
            }
        }
    }
]

prompt = tokenizer.apply_chat_template(
    messages,
    tools=tools,
    tokenize=False,
)

反馈工具响应

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("ai21labs/AI21-Jamba-1.5-Large")

# Note that you must send the tool responses in the same order as the model called the tools:
messages = [
    {
        "role": "user",
        "content": "What's the weather like right now in Jerusalem and in London?"
    },
    {
        "role": "assistant",
        "content": null,
        "tool_calls": [
            {
                "name": "get_current_weather",
                "arguments": "{\"location\": \"Jerusalem\", \"format\": \"celsius\"}"
            },
            {
                "name": "get_current_weather",
                "arguments": "{\"location\": \"London\", \"format\": \"celsius\"}"
            }
        ]
    },
    {
        "role": "tool",
        "content": "The weather in Jerusalem is 18 degrees celsius."
    },
    {
        "role": "tool",
        "content": "The weather in London is 8 degrees celsius."
    }
]

tool_use_prompt = tokenizer.apply_chat_template(
    messages,
    tools=tools,
    tokenize=False,
)

附加文档到提示

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("ai21labs/AI21-Jamba-1.5-Large")

messages = [
        {
            "role": "user",
            "content": "Who wrote Harry Potter?"
        }
]

documents = [
        {
            "text": "Harry Potter is a series of seven fantasy novels written by British author J. K. Rowling.",
            "title": "Harry Potter"
        },
        {
            "text": "The Great Gatsby is a novel by American writer F. Scott Fitzgerald.",
            "title": "The Great Gatsby",
            "country": "United States",
            "genre": "Novel"

        }
]

prompt = tokenizer.apply_chat_template(
    messages,
    documents=documents,
    tokenize=False,
)

# Output: J. K. Rowling

使用JSON模式

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("ai21labs/AI21-Jamba-1.5-Large")
messages = [
    {'role':'user', 
     'content':'Describe the first American president. Include year of birth (number) and name (string).'}
    ]
prompt = tokenizer.apply_chat_template(messages,
                                       tokenize=False,
                                       add_generation_prompt=False,
                                       knobs={"response_format": "json_object", "is_set": True})

#Output: "{ "year of birth": 1732, "name": "George Washington." }"

📚 详细文档

模型详情

属性	详情
开发者	AI21
模型类型	联合注意力和Mamba（Jamba）
许可证	Jamba开放模型许可证
上下文长度	256K
知识截止日期	2024年3月5日
支持语言	英语、西班牙语、法语、葡萄牙语、意大利语、荷兰语、德语、阿拉伯语和希伯来语

常见基准测试结果

通用基准测试

基准测试	Jamba 1.5 Mini	Jamba 1.5 Large
Arena Hard	46.1	65.4
Wild Bench	42.4	48.5
MMLU (CoT)	69.7	81.2
MMLU Pro (CoT)	42.5	53.5
GPQA	32.3	36.9
ARC Challenge	85.7	93
BFCL	80.6	85.5
GSM - 8K	75.8	87
RealToxicity（越低越好）	8.1	6.7
TruthfulQA	54.1	58.3

RULER基准测试 - 有效上下文长度

模型	声明长度	有效长度	4K	8K	16K	32K	64K	128K	256K
Jamba 1.5 Large (94B/398B)	256K	256K	96.7	96.6	96.4	96.0	95.4	95.1	93.9
Jamba 1.5 Mini (12B/52B)	256K	256K	95.7	95.2	94.7	93.8	92.7	89.8	86.1
Gemini 1.5 Pro	1M	>128K	96.7	95.8	96.0	95.9	95.9	94.4	--
GPT - 4 1106 - preview	128K	64K	96.6	96.3	95.2	93.2	87.0	81.2	--
Llama 3.1 70B	128K	64K	96.5	95.8	95.4	94.8	88.4	66.6	--
Command R - plus (104B)	128K	32K	95.6	95.2	94.2	92.0	84.3	63.1	--
Llama 3.1 8B	128K	32K	95.5	93.8	91.6	87.4	84.7	77.0	--
Mistral Large 2 (123B)	128K	32K	96.2	96.1	95.1	93.0	78.8	23.7	--
Mixtral 8x22B (39B/141B)	64K	32K	95.6	94.9	93.4	90.9	84.7	31.7	--
Mixtral 8x7B (12.9B/46.7B)	32K	32K	94.9	92.1	92.5	85.9	72.4	44.5	--

多语言MMLU

语言	Jamba 1.5 Large	Jamba 1.5 Mini
法语	75.8	65.9
西班牙语	75.5	66.3
葡萄牙语	75.5	66.7
意大利语	75.2	65.1
荷兰语	74.6	65.0
德语	73.9	63.8
阿拉伯语	67.1	57.3

注意事项

⚠️ 重要提示

transformers的4.44.0和4.44.1版本存在一个bug，限制了运行Jamba架构的能力。请确保不使用这些版本。

⚠️ 重要提示

如果在安装用于优化Mamba内核的mamba-ssm和causal-conv1d时遇到问题，可以在不使用它们的情况下运行Jamba 1.5 Large，但会增加额外的延迟。为此，在通过AutoModelForCausalLM.from_pretained()加载模型时，添加关键字参数use_mamba_kernels=False。

🔧 技术细节

模型微调

本部分将重点介绍如何使用HF框架 + axolotl和FSDP在单个8xA100/H100（80GB GPU）节点上对Jamba 1.5 Large进行微调。

由于最新版本的transformers在使用FSDP运行时会过度使用CPU RAM内存，我们将使用其修改版本。具体来说，模型会为每个rank完全加载到CPU中，而不是仅为rank0加载，这导致CPU RAM使用量大幅增加 - Jamba 1.5 Large需要的内存超过1.6TB，而不是所需的200GB。特别感谢Wing Lian和axolotl团队的贡献！

确保安装最新版本的axolotl（≥2024年8月21日）或使用他们提供的docker镜像。

量化技术

Jamba 1.5 Large太大，无法在单个8个80GB GPU节点上以全精度（FP32）或半精度（FP16/BF16）加载，因此需要量化。我们开发了一种创新高效的量化技术ExpertsInt8，专为vLLM中部署的MoE模型（包括Jamba模型）设计。使用该技术，可以在单个8个80GB GPU节点上部署Jamba 1.5 Large。