AI21-Jamba-Mini-1.5開源模型 - 高效處理長文本且推理快速的實用工具

首頁

AI21 Jamba Mini 1.5

由ai21labs開發

AI21 Jamba 1.5 Mini 是一款先進的混合SSM-Transformer指令跟隨基礎模型，具有高效的長上下文處理能力和快速的推理速度。

大型語言模型

Transformers

支持多種語言開源協議:其他 #256K長上下文 #混合SSM-Transformer架構 #多語言文本生成

下載量 6,102

發布時間 : 8/19/2024

模型概述

Jamba 1.5 Mini 是市場上最強大且高效的長上下文模型之一，其推理速度比同類領先模型快達2.5倍。它展示了卓越的長上下文處理能力、速度和質量，是首個成功擴展到市場領先模型質量和強度的非Transformer模型。

模型特點

高效的長上下文處理

支持高達256K的上下文長度，能夠處理超長文本輸入。

快速的推理速度

推理速度比同類領先模型快達2.5倍。

混合SSM-Transformer架構

結合了SSM和Transformer的優勢，提供高效且強大的模型性能。

多語言支持

支持英語、法語、德語、荷蘭語、西班牙語、葡萄牙語、意大利語、阿拉伯語和希伯來語。

優化的商業用例

針對函數調用、結構化輸出（JSON）和基於事實的生成等商業用例進行了優化。

模型能力

文本生成

長上下文處理

多語言文本生成

函數調用

結構化輸出（JSON）

基於事實的生成

使用案例

商業應用

函數調用

支持根據用戶請求調用外部函數，實現自動化任務。

高效且準確的函數調用能力。

結構化輸出

生成JSON格式的結構化輸出，便於程序處理。

輸出格式規範且易於解析。

多語言應用

多語言文本生成

支持多種語言的文本生成任務。

高質量的多語言文本輸出。

長文本處理

長文檔摘要

處理長達256K token的長文檔並生成摘要。

高效且準確的摘要生成能力。

🚀 AI21 Jamba 1.5模型

AI21 Jamba 1.5是一系列先進的基礎模型，具備高效的長上下文處理能力和出色的性能。它們在多種語言和任務上表現優異，適用於商業場景，如函數調用、結構化輸出等。

🚀 快速開始

請注意，此版本將於2024年5月6日棄用。我們建議您過渡到新版本，可點擊此處查看。

✨ 主要特性

先進架構：AI21 Jamba 1.5系列模型是最先進的混合SSM - Transformer指令跟隨基礎模型。
高效推理：是市場上最強大、最高效的長上下文模型，推理速度比同類領先模型快達2.5倍。
多語言支持：支持英語、西班牙語、法語、葡萄牙語、意大利語、荷蘭語、德語、阿拉伯語和希伯來語。
商業優化：針對商業用例和功能進行了優化，如函數調用、結構化輸出（JSON）和基於文檔的生成。
靈活授權：根據Jamba開放模型許可證發佈，允許在許可條款下進行全面的研究和商業使用。

📦 安裝指南

運行優化的Mamba實現

要運行優化的Mamba實現，首先需要安裝mamba-ssm和causal-conv1d：

pip install mamba-ssm causal-conv1d>=1.2.0

同時，模型需要部署在CUDA設備上。

安裝vLLM

使用vLLM進行高效推理，需要安裝vLLM（要求版本0.5.4或更高）：

pip install vllm>=0.5.4

使用ExpertsInt8量化

使用ExpertsInt8量化技術，需要安裝vllm版本0.5.5或更高：

pip install vllm>=0.5.5

安裝`transformers`

使用transformers庫時，確保不使用4.44.0和4.44.1版本，因為這些版本存在限制Jamba架構運行的bug。

安裝`trl`進行微調

使用SFTTrainer進行微調時，需要安裝trl：

pip install trl

安裝`bitsandbytes`進行4位量化

使用QLoRA進行微調時，需要安裝bitsandbytes：

pip install bitsandbytes

💻 使用示例

使用vLLM運行模型

from vllm import LLM, SamplingParams
from transformers import AutoTokenizer

model = "ai21labs/AI21-Jamba-1.5-Mini"
number_gpus = 2

llm = LLM(model=model,
          max_model_len=200*1024,
          tensor_parallel_size=number_gpus)

tokenizer = AutoTokenizer.from_pretrained(model)

messages = [
   {"role": "system", "content": "You are an ancient oracle who speaks in cryptic but wise phrases, always hinting at deeper meanings."},
   {"role": "user", "content": "Hello!"},
]

prompts = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)

sampling_params = SamplingParams(temperature=0.4, top_p=0.95, max_tokens=100) 
outputs = llm.generate(prompts, sampling_params)

generated_text = outputs[0].outputs[0].text
print(generated_text)
#Output: Seek and you shall find. The path is winding, but the journey is enlightening. What wisdom do you seek from the ancient echoes?

使用ExpertsInt8量化運行模型

import os
os.environ['VLLM_FUSED_MOE_CHUNK_SIZE']='32768'    # This is a workaround a bug in vLLM's fused_moe kernel

from vllm import LLM
llm = LLM(model="ai21labs/AI21-Jamba-1.5-Mini",
          max_model_len=100*1024,
          quantization="experts_int8")

使用`transformers`運行模型

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("ai21labs/AI21-Jamba-1.5-Mini",
                                             torch_dtype=torch.bfloat16,
                                             attn_implementation="flash_attention_2",
                                             device_map="auto")

tokenizer = AutoTokenizer.from_pretrained("ai21labs/AI21-Jamba-1.5-Mini")

messages = [
   {"role": "system", "content": "You are an ancient oracle who speaks in cryptic but wise phrases, always hinting at deeper meanings."},
   {"role": "user", "content": "Hello!"},
]

input_ids = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors='pt').to(model.device)

outputs = model.generate(input_ids, max_new_tokens=216)

# Decode the output
conversation = tokenizer.decode(outputs[0], skip_special_tokens=True)

# Split the conversation to get only the assistant's response
assistant_response = conversation.split(messages[-1]['content'])[1].strip()
print(assistant_response)
# Output: Seek and you shall find. The path is winding, but the journey is enlightening. What wisdom do you seek from the ancient echoes?

以8位精度加載模型

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(load_in_8bit=True,
                                         llm_int8_skip_modules=["mamba"])
model = AutoModelForCausalLM.from_pretrained("ai21labs/AI21-Jamba-1.5-Mini",
                                             torch_dtype=torch.bfloat16,
                                             attn_implementation="flash_attention_2",
                                             quantization_config=quantization_config)

在CPU上加載模型

from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("ai21labs/AI21-Jamba-1.5-Mini",
                                             use_mamba_kernels=False)

工具使用示例

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("ai21labs/AI21-Jamba-1.5-Mini")

messages = [
    {
        "role": "user", 
        "content": "What's the weather like right now in Jerusalem and in London?"
    }
]

tools = [
    {
        'type': 'function', 
        'function': {
            'name': 'get_current_weather', 
            'description': 'Get the current weather', 
            'parameters': {
                'type': 'object', 
                'properties': {
                    'location': {'type': 'string', 'description': 'The city and state, e.g. San Francisco, CA'}, 
                    'format': {'type': 'string', 'enum': ['celsius', 'fahrenheit'], 'description': 'The temperature unit to use. Infer this from the users location.'}
                }, 
                'required': ['location', 'format']
            }
        }
    }
]

prompt = tokenizer.apply_chat_template(
    messages,
    tools=tools,
    tokenize=False,
)

將工具響應反饋給模型

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("ai21labs/AI21-Jamba-1.5-Mini")

# Note that you must send the tool responses in the same order as the model called the tools:
messages = [
    {
        "role": "user",
        "content": "What's the weather like right now in Jerusalem and in London?"
    },
    {
        "role": "assistant",
        "content": null,
        "tool_calls": [
            {
                "name": "get_current_weather",
                "arguments": "{\"location\": \"Jerusalem\", \"format\": \"celsius\"}"
            },
            {
                "name": "get_current_weather",
                "arguments": "{\"location\": \"London\", \"format\": \"celsius\"}"
            }
        ]
    },
    {
        "role": "tool",
        "content": "The weather in Jerusalem is 18 degrees celsius."
    },
    {
        "role": "tool",
        "content": "The weather in London is 8 degrees celsius."
    }
]

tool_use_prompt = tokenizer.apply_chat_template(
    messages,
    tools=tools,
    tokenize=False,
)

將文檔附加到Jamba 1.5提示

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("ai21labs/AI21-Jamba-1.5-Mini")

messages = [
        {
            "role": "user",
            "content": "Who wrote Harry Potter?"
        }
]

documents = [
        {
            "text": "Harry Potter is a series of seven fantasy novels written by British author J. K. Rowling.",
            "title": "Harry Potter"
        },
        {
            "text": "The Great Gatsby is a novel by American writer F. Scott Fitzgerald.",
            "title": "The Great Gatsby",
            "country": "United States",
            "genre": "Novel"

        }
]

prompt = tokenizer.apply_chat_template(
    messages,
    documents=documents,
    tokenize=False,
)

# Output: J. K. Rowling

使用JSON模式

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("ai21labs/AI21-Jamba-1.5-Mini")
messages = [
    {'role':'user', 
     'content':'Describe the first American president. Include year of birth (number) and name (string).'}
    ]
prompt = tokenizer.apply_chat_template(messages,
                                       tokenize=False,
                                       add_generation_prompt=False,
                                       knobs={"response_format": "json_object", "is_set": True})

#Output: "{ "year of birth": 1732, "name": "George Washington." }"

📚 詳細文檔

模型詳情

屬性	詳情
開發者	AI21
模型類型	聯合注意力和Mamba（Jamba）
許可證	Jamba開放模型許可證
上下文長度	256K
知識截止日期	2024年3月5日
支持語言	英語、西班牙語、法語、葡萄牙語、意大利語、荷蘭語、德語、阿拉伯語和希伯來語

常見基準測試結果

基準測試	Jamba 1.5 Mini	Jamba 1.5 Large
Arena Hard	46.1	65.4
Wild Bench	42.4	48.5
MMLU (CoT)	69.7	81.2
MMLU Pro (CoT)	42.5	53.5
GPQA	32.3	36.9
ARC Challenge	85.7	93
BFCL	80.6	85.5
GSM - 8K	75.8	87
RealToxicity（越低越好）	8.1	6.7
TruthfulQA	54.1	58.3

RULER基準測試 - 有效上下文長度

模型	聲明長度	有效長度	4K	8K	16K	32K	64K	128K	256K
Jamba 1.5 Large (94B/398B)	256K	256K	96.7	96.6	96.4	96.0	95.4	95.1	93.9
Jamba 1.5 Mini (12B/52B)	256K	256K	95.7	95.2	94.7	93.8	92.7	89.8	86.1
Gemini 1.5 Pro	1M	>128K	96.7	95.8	96.0	95.9	95.9	94.4	--
GPT - 4 1106 - preview	128K	64K	96.6	96.3	95.2	93.2	87.0	81.2	--
Llama 3.1 70B	128K	64K	96.5	95.8	95.4	94.8	88.4	66.6	--
Command R - plus (104B)	128K	32K	95.6	95.2	94.2	92.0	84.3	63.1	--
Llama 3.1 8B	128K	32K	95.5	93.8	91.6	87.4	84.7	77.0	--
Mistral Large 2 (123B)	128K	32K	96.2	96.1	95.1	93.0	78.8	23.7	--
Mixtral 8x22B (39B/141B)	64K	32K	95.6	94.9	93.4	90.9	84.7	31.7	--
Mixtral 8x7B (12.9B/46.7B)	32K	32K	94.9	92.1	92.5	85.9	72.4	44.5	--