AI21-Jamba-Mini-1.5 Open-Source Model - A Practical Tool for Efficient Long-Text Processing and Fast Inference

AI21 Jamba Mini 1.5

Developed by ai21labs

AI21 Jamba 1.5 Mini is an advanced hybrid SSM-Transformer instruction-following foundation model with efficient long-context processing capabilities and fast inference speed.

Large Language Model

Transformers

Supports Multiple LanguagesOpen Source License:Other #256K long context #Hybrid SSM-Transformer architecture #Multilingual text generation

Downloads 6,102

Release Time : 8/19/2024

Model Overview

Jamba 1.5 Mini is one of the most powerful and efficient long-context models on the market, with inference speeds up to 2.5x faster than leading comparable models. It demonstrates exceptional long-context processing capabilities, speed, and quality, being the first non-Transformer model to successfully scale to the quality and strength of market-leading models.

Model Features

Efficient long-context processing

Supports context lengths up to 256K, capable of handling ultra-long text inputs.

Fast inference speed

Inference speeds up to 2.5x faster than leading comparable models.

Hybrid SSM-Transformer architecture

Combines the strengths of SSM and Transformer to deliver efficient and powerful model performance.

Multilingual support

Supports English, French, German, Dutch, Spanish, Portuguese, Italian, Arabic, and Hebrew.

Optimized for business use cases

Optimized for business use cases such as function calling, structured output (JSON), and fact-based generation.

Model Capabilities

Text generation

Long-context processing

Multilingual text generation

Function calling

Structured output (JSON)

Fact-based generation

Use Cases

Business applications

Function calling

Supports calling external functions based on user requests to automate tasks.

Efficient and accurate function calling capability.

Structured output

Generates structured output in JSON format for easy program processing.

Standardized and easily parsable output format.

Multilingual applications

Multilingual text generation

Supports text generation tasks in multiple languages.

High-quality multilingual text output.

Long-text processing

Long document summarization

Processes long documents up to 256K tokens and generates summaries.

Efficient and accurate summarization capability.

🚀 AI21 Jamba 1.5 Model

AI21 Jamba 1.5 is a state-of-the-art, hybrid SSM-Transformer instruction following foundation model, offering superior long context handling, speed, and quality.

🚀 Quick Start

Please note that this version will be deprecated on May 6, 2024. We encourage you to transition to the new version, which can be found here.

✨ Features

State-of-the-art Performance: The AI21 Jamba 1.5 family of models is state-of-the-art, delivering up to 2.5X faster inference than leading models of comparable sizes.
Superior Long Context Handling: The models demonstrate superior long context handling, speed, and quality.
Optimized for Business Use Cases: Jamba 1.5 Mini (12B active/52B total) and Jamba 1.5 Large (94B active/398B total) are optimized for business use cases and capabilities such as function calling, structured output (JSON), and grounded generation.
Permissive License: The models are released under the Jamba Open Model License, allowing full research use and commercial use under the license terms.

📦 Installation

Prerequisites

In order to run optimized Mamba implementations, you first need to install mamba-ssm and causal-conv1d:

pip install mamba-ssm causal-conv1d>=1.2.0

You also have to have the model on a CUDA device.

Install vLLM

The recommended way to perform efficient inference with Jamba 1.5 Mini is using vLLM. First, make sure to install vLLM (version 0.5.4 or higher is required)

pip install vllm>=0.5.4

💻 Usage Examples

Run the model with vLLM

In the example below, number_gpus should match the number of GPUs you want to deploy Jamba 1.5 Mini on. A minimum of 2 80GB GPUs is required.

from vllm import LLM, SamplingParams
from transformers import AutoTokenizer

model = "ai21labs/AI21-Jamba-1.5-Mini"
number_gpus = 2

llm = LLM(model=model,
          max_model_len=200*1024,
          tensor_parallel_size=number_gpus)

tokenizer = AutoTokenizer.from_pretrained(model)

messages = [
   {"role": "system", "content": "You are an ancient oracle who speaks in cryptic but wise phrases, always hinting at deeper meanings."},
   {"role": "user", "content": "Hello!"},
]

prompts = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)

sampling_params = SamplingParams(temperature=0.4, top_p=0.95, max_tokens=100) 
outputs = llm.generate(prompts, sampling_params)

generated_text = outputs[0].outputs[0].text
print(generated_text)
#Output: Seek and you shall find. The path is winding, but the journey is enlightening. What wisdom do you seek from the ancient echoes?

With the default BF16 precision on 2 80GB A100 GPUs and default vLLM configuration, you'll be able to perform inference on prompts up to 200K tokens long. On more than 2 80GB GPUs, you can easily fit the full 256K context.

Note: vLLM's main branch has some memory utilization improvements specific to the Jamba architecture that allow using the full 256K context length on 2 80 GPUs. You can build vLLM from source if you wish to make use of them.

ExpertsInt8 quantization

We've developed an innovative and efficient quantization technique, ExpertsInt8, designed for MoE models deployed in vLLM, including Jamba models. Using it, you'll be able to deploy Jamba 1.5 Mini on a single 80GB GPU.

In order to use ExpertsInt8, you need to use vllm version 0.5.5 or higher: pip install vllm>=0.5.5

With default vLLM configuration, you can fit prompts up to 100K on a single 80GB A100 GPU:

import os
os.environ['VLLM_FUSED_MOE_CHUNK_SIZE']='32768'    # This is a workaround a bug in vLLM's fused_moe kernel

from vllm import LLM
llm = LLM(model="ai21labs/AI21-Jamba-1.5-Mini",
          max_model_len=100*1024,
          quantization="experts_int8")

Run the model with `transformers`

The following example loads Jamba 1.5 Mini to the GPU in BF16 precision, uses optimized FlashAttention2 and Mamba kernels, and parallelizes the model across multiple GPUs using accelerate. Note that in half precision (FP16/BF16), Jamba 1.5 Mini is too large to fit on a single 80GB GPU, so you'll need at least 2 such GPUs.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("ai21labs/AI21-Jamba-1.5-Mini",
                                             torch_dtype=torch.bfloat16,
                                             attn_implementation="flash_attention_2",
                                             device_map="auto")

tokenizer = AutoTokenizer.from_pretrained("ai21labs/AI21-Jamba-1.5-Mini")

messages = [
   {"role": "system", "content": "You are an ancient oracle who speaks in cryptic but wise phrases, always hinting at deeper meanings."},
   {"role": "user", "content": "Hello!"},
]

input_ids = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors='pt').to(model.device)

outputs = model.generate(input_ids, max_new_tokens=216)

# Decode the output
conversation = tokenizer.decode(outputs[0], skip_special_tokens=True)

# Split the conversation to get only the assistant's response
assistant_response = conversation.split(messages[-1]['content'])[1].strip()
print(assistant_response)
# Output: Seek and you shall find. The path is winding, but the journey is enlightening. What wisdom do you seek from the ancient echoes?

Note: Versions 4.44.0 and 4.44.1 of transformers have a bug that restricts the ability to run the Jamba architecture. Make sure you're not using these versions.

Note: If you're having trouble installing mamba-ssm and causal-conv1d for the optimized Mamba kernels, you can run Jamba 1.5 Mini without them, at the cost of extra latency. In order to do that, add the kwarg use_mamba_kernels=False when loading the model via AutoModelForCausalLM.from_pretained().

Load the model in 8-bit

Using 8-bit precision, it is possible to fit up to 140K sequence length on a single 80GB GPU. You can easily quantize the model to 8-bit using bitsandbytes. In order to not degrade model quality, we recommend to exclude the Mamba blocks from the quantization:

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(load_in_8bit=True,
                                         llm_int8_skip_modules=["mamba"])
model = AutoModelForCausalLM.from_pretrained("ai21labs/AI21-Jamba-1.5-Mini",
                                             torch_dtype=torch.bfloat16,
                                             attn_implementation="flash_attention_2",
                                             quantization_config=quantization_config)

Load the model on CPU

If you don't have access to a GPU, you can also load and run Jamba 1.5 Mini on a CPU. Note this will result in poor inference performance.

from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("ai21labs/AI21-Jamba-1.5-Mini",
                                             use_mamba_kernels=False)

📚 Documentation

Model Details

Property	Details
Developed by	AI21
Model Type	Joint Attention and Mamba (Jamba)
License	Jamba Open Model License
Context length	256K
Knowledge cutoff date	March 5, 2024
Supported languages	English, Spanish, French, Portuguese, Italian, Dutch, German, Arabic and Hebrew

Results on common benchmarks

Benchmark	Jamba 1.5 Mini	Jamba 1.5 Large
Arena Hard	46.1	65.4
Wild Bench	42.4	48.5
MMLU (CoT)	69.7	81.2
MMLU Pro (CoT)	42.5	53.5
GPQA	32.3	36.9
ARC Challenge	85.7	93
BFCL	80.6	85.5
GSM-8K	75.8	87
RealToxicity (lower is better)	8.1	6.7
TruthfulQA	54.1	58.3

RULER Benchmark - Effective context length

Models	Claimed Length	Effective Length	4K	8K	16K	32K	64K	128K	256K
Jamba 1.5 Large (94B/398B)	256K	256K	96.7	96.6	96.4	96.0	95.4	95.1	93.9
Jamba 1.5 Mini (12B/52B)	256K	256K	95.7	95.2	94.7	93.8	92.7	89.8	86.1
Gemini 1.5 Pro	1M	>128K	96.7	95.8	96.0	95.9	95.9	94.4	--
GPT-4 1106-preview	128K	64K	96.6	96.3	95.2	93.2	87.0	81.2	--
Llama 3.1 70B	128K	64K	96.5	95.8	95.4	94.8	88.4	66.6	--
Command R-plus (104B)	128K	32K	95.6	95.2	94.2	92.0	84.3	63.1	--
Llama 3.1 8B	128K	32K	95.5	93.8	91.6	87.4	84.7	77.0	--
Mistral Large 2 (123B)	128K	32K	96.2	96.1	95.1	93.0	78.8	23.7	--
Mixtral 8x22B (39B/141B)	64K	32K	95.6	94.9	93.4	90.9	84.7	31.7	--
Mixtral 8x7B (12.9B/46.7B)	32K	32K	94.9	92.1	92.5	85.9	72.4	44.5	--

Multilingual MMLU

Language	Jamba 1.5 Large	Jamba 1.5 Mini
French	75.8	65.9
Spanish	75.5	66.3
Portuguese	75.5	66.7
Italian	75.2	65.1
Dutch	74.6	65.0
German	73.9	63.8
Arabic	67.1	57.3

Model features

Tool use with Jamba

Jamba 1.5 supports tool use capabilities in accordance with Huggingface's tool use API. The tools defined by the user are inserted into a dedicated section in the chat template which the model was trained to recognize.

Given a conversation that contains tools, the model can output content, tool invocations or both. Tool invocations are formatted within the assistant message as a list of json-formatted dictionaries, wrapped in dedicated special token as can be seen in the example below.

Tool usage example

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("ai21labs/AI21-Jamba-1.5-Mini")

messages = [
    {
        "role": "user", 
        "content": "What's the weather like right now in Jerusalem and in London?"
    }
]

tools = [
    {
        'type': 'function', 
        'function': {
            'name': 'get_current_weather', 
            'description': 'Get the current weather', 
            'parameters': {
                'type': 'object', 
                'properties': {
                    'location': {'type': 'string', 'description': 'The city and state, e.g. San Francisco, CA'}, 
                    'format': {'type': 'string', 'enum': ['celsius', 'fahrenheit'], 'description': 'The temperature unit to use. Infer this from the users location.'}
                }, 
                'required': ['location', 'format']
            }
        }
    }
]

prompt = tokenizer.apply_chat_template(
    messages,
    tools=tools,
    tokenize=False,
)

📄 License

The models are released under the Jamba Open Model License, a permissive license allowing full research use and commercial use under the license terms. If you need to license the model for your needs, talk to us.

For more details of this model, see the white paper and the release blog post.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご