AI21-Jamba-Large-1.5開源基礎AI模型 - 高效處理長內容，適用多業務場景

首頁

AI21 Jamba Large 1.5

由ai21labs開發

AI21 Jamba 1.5是一系列先進的基礎模型，具備強大的長上下文處理能力和高效的推理速度，適用於多種業務場景。

大型語言模型

Safetensors

支持多種語言開源協議:其他 #長上下文處理 #多語言支持 #高效推理

下載量 2,642

發布時間 : 8/19/2024

模型概述

AI21 Jamba 1.5系列模型是最先進的混合SSM - Transformer指令跟隨基礎模型，支持多種語言，適用於函數調用、結構化輸出等業務場景。

模型特點

先進架構

混合SSM - Transformer架構，結合了狀態空間模型和Transformer的優勢。

高效推理

推理速度比同類領先模型快達2.5倍，支持長上下文處理。

多語言支持

支持9種語言，包括英語、西班牙語、法語等。

業務優化

針對函數調用、結構化輸出（JSON）和有根據的生成等業務用例進行了優化。

靈活部署

支持在單節點8個80GB GPU上部署，提供高效的量化技術ExpertsInt8。

模型能力

文本生成

函數調用

結構化輸出

多語言處理

長上下文理解

使用案例

業務應用

函數調用

支持在業務場景中調用外部函數，實現自動化任務。

結構化輸出

生成JSON格式的結構化輸出，便於業務系統集成。

多語言處理

多語言文本生成

支持多種語言的文本生成，適用於國際化業務場景。

🚀 AI21 Jamba 1.5模型

AI21 Jamba 1.5是一系列先進的基礎模型，具備強大的長上下文處理能力和高效的推理速度，適用於多種業務場景，如函數調用、結構化輸出等。該系列模型在多個基準測試中表現優異，且支持多種語言。

🚀 快速開始

請注意，此版本將於2024年5月6日停用。我們建議您過渡到新版本，可點擊此處獲取。

✨ 主要特性

先進架構：AI21 Jamba 1.5系列模型是最先進的混合SSM - Transformer指令跟隨基礎模型。
高效推理：是市場上最強大、最高效的長上下文模型，推理速度比同類領先模型快達2.5倍。
多語言支持：支持英語、西班牙語、法語、葡萄牙語、意大利語、荷蘭語、德語、阿拉伯語和希伯來語。
業務優化：針對業務用例和功能進行了優化，如函數調用、結構化輸出（JSON）和有根據的生成。
靈活部署：開發了適用於vLLM中部署的MoE模型的創新高效量化技術ExpertsInt8，可在單節點8個80GB GPU上部署Jamba 1.5 Large。

📦 安裝指南

運行優化的Mamba實現

要運行優化的Mamba實現，首先需要安裝mamba-ssm和causal-conv1d：

pip install mamba-ssm causal-conv1d>=1.2.0

同時，模型必須部署在CUDA設備上。

安裝vLLM

使用vLLM對Jamba 1.5 Large進行高效推理的推薦方式。首先，確保安裝vLLM（需要0.5.5或更高版本）：

pip install vllm>=0.5.5

安裝其他依賴

在使用HF框架 + axolotl和FSDP對Jamba 1.5 Large進行微調時，需要安裝以下依賴：

git clone https://github.com/axolotl-ai-cloud/axolotl
cd axolotl
pip3 install packaging ninja
pip3 install -e '.[flash-attn,deepspeed]'

pip install bitsandbytes~=0.43.3
pip install trl
pip install peft~=0.12.0
pip install accelerate~=0.33.0
pip install mamba-ssm causal-conv1d>=1.2.0
pip install git+https://github.com/xgal/transformers@897f80665c37c531b7803f92655db

💻 使用示例

基礎用法

使用vLLM進行推理

from vllm import LLM, SamplingParams
from transformers import AutoTokenizer

model = "ai21labs/AI21-Jamba-1.5-Large"

llm = LLM(model=model,
          tensor_parallel_size=8,
          max_model_len=220*1024,
          quantization="experts_int8",
         )

tokenizer = AutoTokenizer.from_pretrained(model)

messages = [
   {"role": "system", "content": "You are an ancient oracle who speaks in cryptic but wise phrases, always hinting at deeper meanings."},
   {"role": "user", "content": "Hello!"},
]

prompts = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)

sampling_params = SamplingParams(temperature=0.4, top_p=0.95, max_tokens=100)
outputs = llm.generate(prompts, sampling_params)

generated_text = outputs[0].outputs[0].text
print(generated_text)

使用`transformers`加載模型

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(load_in_8bit=True,
                                         llm_int8_skip_modules=["mamba"])

# a device map to distribute the model evenly across 8 GPUs
device_map = {'model.embed_tokens': 0, 'model.layers.0': 0, 'model.layers.1': 0, 'model.layers.2': 0, 'model.layers.3': 0, 'model.layers.4': 0, 'model.layers.5': 0, 'model.layers.6': 0, 'model.layers.7': 0, 'model.layers.8': 0, 'model.layers.9': 1, 'model.layers.10': 1, 'model.layers.11': 1, 'model.layers.12': 1, 'model.layers.13': 1, 'model.layers.14': 1, 'model.layers.15': 1, 'model.layers.16': 1, 'model.layers.17': 1, 'model.layers.18': 2, 'model.layers.19': 2, 'model.layers.20': 2, 'model.layers.21': 2, 'model.layers.22': 2, 'model.layers.23': 2, 'model.layers.24': 2, 'model.layers.25': 2, 'model.layers.26': 2, 'model.layers.27': 3, 'model.layers.28': 3, 'model.layers.29': 3, 'model.layers.30': 3, 'model.layers.31': 3, 'model.layers.32': 3, 'model.layers.33': 3, 'model.layers.34': 3, 'model.layers.35': 3, 'model.layers.36': 4, 'model.layers.37': 4, 'model.layers.38': 4, 'model.layers.39': 4, 'model.layers.40': 4, 'model.layers.41': 4, 'model.layers.42': 4, 'model.layers.43': 4, 'model.layers.44': 4, 'model.layers.45': 5, 'model.layers.46': 5, 'model.layers.47': 5, 'model.layers.48': 5, 'model.layers.49': 5, 'model.layers.50': 5, 'model.layers.51': 5, 'model.layers.52': 5, 'model.layers.53': 5, 'model.layers.54': 6, 'model.layers.55': 6, 'model.layers.56': 6, 'model.layers.57': 6, 'model.layers.58': 6, 'model.layers.59': 6, 'model.layers.60': 6, 'model.layers.61': 6, 'model.layers.62': 6, 'model.layers.63': 7, 'model.layers.64': 7, 'model.layers.65': 7, 'model.layers.66': 7, 'model.layers.67': 7, 'model.layers.68': 7, 'model.layers.69': 7, 'model.layers.70': 7, 'model.layers.71': 7, 'model.final_layernorm': 7, 'lm_head': 7}
model = AutoModelForCausalLM.from_pretrained("ai21labs/AI21-Jamba-1.5-Large",
                                             torch_dtype=torch.bfloat16,
                                             attn_implementation="flash_attention_2",
                                             quantization_config=quantization_config,
                                             device_map=device_map)

tokenizer = AutoTokenizer.from_pretrained("ai21labs/AI21-Jamba-1.5-Large")

messages = [
   {"role": "system", "content": "You are an ancient oracle who speaks in cryptic but wise phrases, always hinting at deeper meanings."},
   {"role": "user", "content": "Hello!"},
]

input_ids = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors='pt').to(model.device)

outputs = model.generate(input_ids, max_new_tokens=216)

# Decode the output
conversation = tokenizer.decode(outputs[0], skip_special_tokens=True)

# Split the conversation to get only the assistant's response
assistant_response = conversation.split(messages[-1]['content'])[1].strip()
print(assistant_response)
# Output: Seek and you shall find. The path is winding, but the journey is enlightening. What wisdom do you seek from the ancient echoes?

高級用法

在CPU上加載模型

from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("ai21labs/AI21-Jamba-1.5-Large",
                                             use_mamba_kernels=False)

工具使用示例

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("ai21labs/AI21-Jamba-1.5-Large")

messages = [
    {
        "role": "user", 
        "content": "What's the weather like right now in Jerusalem and in London?"
    }
]

tools = [
    {
        'type': 'function', 
        'function': {
            'name': 'get_current_weather', 
            'description': 'Get the current weather', 
            'parameters': {
                'type': 'object', 
                'properties': {
                    'location': {'type': 'string', 'description': 'The city and state, e.g. San Francisco, CA'}, 
                    'format': {'type': 'string', 'enum': ['celsius', 'fahrenheit'], 'description': 'The temperature unit to use. Infer this from the users location.'}
                }, 
                'required': ['location', 'format']
            }
        }
    }
]

prompt = tokenizer.apply_chat_template(
    messages,
    tools=tools,
    tokenize=False,
)

反饋工具響應

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("ai21labs/AI21-Jamba-1.5-Large")

# Note that you must send the tool responses in the same order as the model called the tools:
messages = [
    {
        "role": "user",
        "content": "What's the weather like right now in Jerusalem and in London?"
    },
    {
        "role": "assistant",
        "content": null,
        "tool_calls": [
            {
                "name": "get_current_weather",
                "arguments": "{\"location\": \"Jerusalem\", \"format\": \"celsius\"}"
            },
            {
                "name": "get_current_weather",
                "arguments": "{\"location\": \"London\", \"format\": \"celsius\"}"
            }
        ]
    },
    {
        "role": "tool",
        "content": "The weather in Jerusalem is 18 degrees celsius."
    },
    {
        "role": "tool",
        "content": "The weather in London is 8 degrees celsius."
    }
]

tool_use_prompt = tokenizer.apply_chat_template(
    messages,
    tools=tools,
    tokenize=False,
)

附加文檔到提示

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("ai21labs/AI21-Jamba-1.5-Large")

messages = [
        {
            "role": "user",
            "content": "Who wrote Harry Potter?"
        }
]

documents = [
        {
            "text": "Harry Potter is a series of seven fantasy novels written by British author J. K. Rowling.",
            "title": "Harry Potter"
        },
        {
            "text": "The Great Gatsby is a novel by American writer F. Scott Fitzgerald.",
            "title": "The Great Gatsby",
            "country": "United States",
            "genre": "Novel"

        }
]

prompt = tokenizer.apply_chat_template(
    messages,
    documents=documents,
    tokenize=False,
)

# Output: J. K. Rowling

使用JSON模式

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("ai21labs/AI21-Jamba-1.5-Large")
messages = [
    {'role':'user', 
     'content':'Describe the first American president. Include year of birth (number) and name (string).'}
    ]
prompt = tokenizer.apply_chat_template(messages,
                                       tokenize=False,
                                       add_generation_prompt=False,
                                       knobs={"response_format": "json_object", "is_set": True})

#Output: "{ "year of birth": 1732, "name": "George Washington." }"

📚 詳細文檔

模型詳情

屬性	詳情
開發者	AI21
模型類型	聯合注意力和Mamba（Jamba）
許可證	Jamba開放模型許可證
上下文長度	256K
知識截止日期	2024年3月5日
支持語言	英語、西班牙語、法語、葡萄牙語、意大利語、荷蘭語、德語、阿拉伯語和希伯來語

常見基準測試結果

通用基準測試

基準測試	Jamba 1.5 Mini	Jamba 1.5 Large
Arena Hard	46.1	65.4
Wild Bench	42.4	48.5
MMLU (CoT)	69.7	81.2
MMLU Pro (CoT)	42.5	53.5
GPQA	32.3	36.9
ARC Challenge	85.7	93
BFCL	80.6	85.5
GSM - 8K	75.8	87
RealToxicity（越低越好）	8.1	6.7
TruthfulQA	54.1	58.3

RULER基準測試 - 有效上下文長度

模型	聲明長度	有效長度	4K	8K	16K	32K	64K	128K	256K
Jamba 1.5 Large (94B/398B)	256K	256K	96.7	96.6	96.4	96.0	95.4	95.1	93.9
Jamba 1.5 Mini (12B/52B)	256K	256K	95.7	95.2	94.7	93.8	92.7	89.8	86.1
Gemini 1.5 Pro	1M	>128K	96.7	95.8	96.0	95.9	95.9	94.4	--
GPT - 4 1106 - preview	128K	64K	96.6	96.3	95.2	93.2	87.0	81.2	--
Llama 3.1 70B	128K	64K	96.5	95.8	95.4	94.8	88.4	66.6	--
Command R - plus (104B)	128K	32K	95.6	95.2	94.2	92.0	84.3	63.1	--
Llama 3.1 8B	128K	32K	95.5	93.8	91.6	87.4	84.7	77.0	--
Mistral Large 2 (123B)	128K	32K	96.2	96.1	95.1	93.0	78.8	23.7	--
Mixtral 8x22B (39B/141B)	64K	32K	95.6	94.9	93.4	90.9	84.7	31.7	--
Mixtral 8x7B (12.9B/46.7B)	32K	32K	94.9	92.1	92.5	85.9	72.4	44.5	--

多語言MMLU

語言	Jamba 1.5 Large	Jamba 1.5 Mini
法語	75.8	65.9
西班牙語	75.5	66.3
葡萄牙語	75.5	66.7
意大利語	75.2	65.1
荷蘭語	74.6	65.0
德語	73.9	63.8
阿拉伯語	67.1	57.3

注意事項

⚠️ 重要提示

transformers的4.44.0和4.44.1版本存在一個bug，限制了運行Jamba架構的能力。請確保不使用這些版本。

⚠️ 重要提示

如果在安裝用於優化Mamba內核的mamba-ssm和causal-conv1d時遇到問題，可以在不使用它們的情況下運行Jamba 1.5 Large，但會增加額外的延遲。為此，在通過AutoModelForCausalLM.from_pretained()加載模型時，添加關鍵字參數use_mamba_kernels=False。

🔧 技術細節

模型微調

本部分將重點介紹如何使用HF框架 + axolotl和FSDP在單個8xA100/H100（80GB GPU）節點上對Jamba 1.5 Large進行微調。

由於最新版本的transformers在使用FSDP運行時會過度使用CPU RAM內存，我們將使用其修改版本。具體來說，模型會為每個rank完全加載到CPU中，而不是僅為rank0加載，這導致CPU RAM使用量大幅增加 - Jamba 1.5 Large需要的內存超過1.6TB，而不是所需的200GB。特別感謝Wing Lian和axolotl團隊的貢獻！

確保安裝最新版本的axolotl（≥2024年8月21日）或使用他們提供的docker鏡像。

量化技術

Jamba 1.5 Large太大，無法在單個8個80GB GPU節點上以全精度（FP32）或半精度（FP16/BF16）加載，因此需要量化。我們開發了一種創新高效的量化技術ExpertsInt8，專為vLLM中部署的MoE模型（包括Jamba模型）設計。使用該技術，可以在單個8個80GB GPU節點上部署Jamba 1.5 Large。