AI21-Jamba-Large-1.5オープンソース基礎AIモデル - 長い内容を効率的に処理し、多様なビジネスシーンに適用

ホーム

AI21 Jamba Large 1.5

ai21labsによって開発

AI21 Jamba 1.5は一連の高度な基礎モデルで、強力な長文脈処理能力と高効率な推論速度を備え、さまざまな業務シーンに適しています。

大規模言語モデル

Safetensors

複数言語対応オープンソースライセンス:その他 #长文脈処理 #多言語対応 #高効率推論

ダウンロード数 2,642

リリース時間 : 8/19/2024

モデル概要

AI21 Jamba 1.5シリーズのモデルは、最先端の混合SSM - Transformer命令追従基礎モデルで、多言語をサポートし、関数呼び出しや構造化出力などの業務シーンに適しています。

モデル特徴

高度なアーキテクチャ

混合SSM - Transformerアーキテクチャは、状態空間モデルとTransformerの長所を組み合わせています。

高効率推論

推論速度は同業のリーディングモデルより最大2.5倍速く、長文脈処理をサポートします。

多言語対応

英語、スペイン語、フランス語など9つの言語をサポートします。

業務最適化

関数呼び出し、構造化出力（JSON）、根拠のある生成などの業務ユースケースに対して最適化されています。

柔軟なデプロイ

単一ノードの8つの80GB GPUでのデプロイをサポートし、高効率な量子化技術ExpertsInt8を提供します。

モデル能力

テキスト生成

関数呼び出し

構造化出力

多言語処理

長文脈理解

使用事例

業務アプリケーション

関数呼び出し

業務シーンで外部関数を呼び出し、自動化タスクを実現します。

構造化出力

JSON形式の構造化出力を生成し、業務システムへの統合を容易にします。

多言語処理

多言語テキスト生成

多言語のテキスト生成をサポートし、国際化業務シーンに適しています。

🚀 AI21 Jamba 1.5モデル

AI21 Jamba 1.5は、一連の高度な基礎モデルです。強力な長文脈処理能力と高い推論速度を備え、関数呼び出しや構造化出力など、様々なビジネスシーンに適用できます。このシリーズのモデルは、複数のベンチマークテストで優れた成績を収め、複数の言語をサポートしています。

🚀 クイックスタート

このバージョンは2024年5月6日に廃止されます。新しいバージョンに移行することをおすすめします。こちらをクリックして取得できます。

✨ 主な機能

先進的なアーキテクチャ：AI21 Jamba 1.5シリーズのモデルは、最先端の混合SSM - Transformer命令追従基礎モデルです。
高効率な推論：市場で最も強力で効率的な長文脈モデルで、推論速度は同業のリーディングモデルより最大2.5倍速いです。
多言語対応：英語、スペイン語、フランス語、ポルトガル語、イタリア語、オランダ語、ドイツ語、アラビア語、ヘブライ語をサポートしています。
ビジネス最適化：関数呼び出し、構造化出力（JSON）、根拠のある生成などのビジネスユースケースや機能に合わせて最適化されています。
柔軟なデプロイ：vLLMでのMoEモデルのデプロイに適した革新的で高効率な量子化技術「ExpertsInt8」を開発し、Jamba 1.5 Largeを単一ノードの8つの80GB GPU上にデプロイできます。

📦 インストール

最適化されたMamba実装の実行

最適化されたMamba実装を実行するには、まずmamba-ssmとcausal-conv1dをインストールする必要があります。

pip install mamba-ssm causal-conv1d>=1.2.0

また、モデルはCUDAデバイス上にデプロイする必要があります。

vLLMのインストール

Jamba 1.5 Largeを効率的に推論するために、vLLMの使用が推奨されます。まず、vLLMをインストールしてください（バージョン0.5.5以上が必要）。

pip install vllm>=0.5.5

その他の依存関係のインストール

HFフレームワーク + axolotlおよびFSDPを使用してJamba 1.5 Largeを微調整する場合、以下の依存関係をインストールする必要があります。

git clone https://github.com/axolotl-ai-cloud/axolotl
cd axolotl
pip3 install packaging ninja
pip3 install -e '.[flash-attn,deepspeed]'

pip install bitsandbytes~=0.43.3
pip install trl
pip install peft~=0.12.0
pip install accelerate~=0.33.0
pip install mamba-ssm causal-conv1d>=1.2.0
pip install git+https://github.com/xgal/transformers@897f80665c37c531b7803f92655db

💻 使用例

基本的な使用法

vLLMを使用した推論

from vllm import LLM, SamplingParams
from transformers import AutoTokenizer

model = "ai21labs/AI21-Jamba-1.5-Large"

llm = LLM(model=model,
          tensor_parallel_size=8,
          max_model_len=220*1024,
          quantization="experts_int8",
         )

tokenizer = AutoTokenizer.from_pretrained(model)

messages = [
   {"role": "system", "content": "You are an ancient oracle who speaks in cryptic but wise phrases, always hinting at deeper meanings."},
   {"role": "user", "content": "Hello!"},
]

prompts = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)

sampling_params = SamplingParams(temperature=0.4, top_p=0.95, max_tokens=100)
outputs = llm.generate(prompts, sampling_params)

generated_text = outputs[0].outputs[0].text
print(generated_text)

`transformers`を使用したモデルの読み込み

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(load_in_8bit=True,
                                         llm_int8_skip_modules=["mamba"])

# a device map to distribute the model evenly across 8 GPUs
device_map = {'model.embed_tokens': 0, 'model.layers.0': 0, 'model.layers.1': 0, 'model.layers.2': 0, 'model.layers.3': 0, 'model.layers.4': 0, 'model.layers.5': 0, 'model.layers.6': 0, 'model.layers.7': 0, 'model.layers.8': 0, 'model.layers.9': 1, 'model.layers.10': 1, 'model.layers.11': 1, 'model.layers.12': 1, 'model.layers.13': 1, 'model.layers.14': 1, 'model.layers.15': 1, 'model.layers.16': 1, 'model.layers.17': 1, 'model.layers.18': 2, 'model.layers.19': 2, 'model.layers.20': 2, 'model.layers.21': 2, 'model.layers.22': 2, 'model.layers.23': 2, 'model.layers.24': 2, 'model.layers.25': 2, 'model.layers.26': 2, 'model.layers.27': 3, 'model.layers.28': 3, 'model.layers.29': 3, 'model.layers.30': 3, 'model.layers.31': 3, 'model.layers.32': 3, 'model.layers.33': 3, 'model.layers.34': 3, 'model.layers.35': 3, 'model.layers.36': 4, 'model.layers.37': 4, 'model.layers.38': 4, 'model.layers.39': 4, 'model.layers.40': 4, 'model.layers.41': 4, 'model.layers.42': 4, 'model.layers.43': 4, 'model.layers.44': 4, 'model.layers.45': 5, 'model.layers.46': 5, 'model.layers.47': 5, 'model.layers.48': 5, 'model.layers.49': 5, 'model.layers.50': 5, 'model.layers.51': 5, 'model.layers.52': 5, 'model.layers.53': 5, 'model.layers.54': 6, 'model.layers.55': 6, 'model.layers.56': 6, 'model.layers.57': 6, 'model.layers.58': 6, 'model.layers.59': 6, 'model.layers.60': 6, 'model.layers.61': 6, 'model.layers.62': 6, 'model.layers.63': 7, 'model.layers.64': 7, 'model.layers.65': 7, 'model.layers.66': 7, 'model.layers.67': 7, 'model.layers.68': 7, 'model.layers.69': 7, 'model.layers.70': 7, 'model.layers.71': 7, 'model.final_layernorm': 7, 'lm_head': 7}
model = AutoModelForCausalLM.from_pretrained("ai21labs/AI21-Jamba-1.5-Large",
                                             torch_dtype=torch.bfloat16,
                                             attn_implementation="flash_attention_2",
                                             quantization_config=quantization_config,
                                             device_map=device_map)

tokenizer = AutoTokenizer.from_pretrained("ai21labs/AI21-Jamba-1.5-Large")

messages = [
   {"role": "system", "content": "You are an ancient oracle who speaks in cryptic but wise phrases, always hinting at deeper meanings."},
   {"role": "user", "content": "Hello!"},
]

input_ids = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors='pt').to(model.device)

outputs = model.generate(input_ids, max_new_tokens=216)

# Decode the output
conversation = tokenizer.decode(outputs[0], skip_special_tokens=True)

# Split the conversation to get only the assistant's response
assistant_response = conversation.split(messages[-1]['content'])[1].strip()
print(assistant_response)
# Output: Seek and you shall find. The path is winding, but the journey is enlightening. What wisdom do you seek from the ancient echoes?

高度な使用法

CPUでのモデルの読み込み

from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("ai21labs/AI21-Jamba-1.5-Large",
                                             use_mamba_kernels=False)

ツールの使用例

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("ai21labs/AI21-Jamba-1.5-Large")

messages = [
    {
        "role": "user", 
        "content": "What's the weather like right now in Jerusalem and in London?"
    }
]

tools = [
    {
        'type': 'function', 
        'function': {
            'name': 'get_current_weather', 
            'description': 'Get the current weather', 
            'parameters': {
                'type': 'object', 
                'properties': {
                    'location': {'type': 'string', 'description': 'The city and state, e.g. San Francisco, CA'}, 
                    'format': {'type': 'string', 'enum': ['celsius', 'fahrenheit'], 'description': 'The temperature unit to use. Infer this from the users location.'}
                }, 
                'required': ['location', 'format']
            }
        }
    }
]

prompt = tokenizer.apply_chat_template(
    messages,
    tools=tools,
    tokenize=False,
)

ツール応答のフィードバック

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("ai21labs/AI21-Jamba-1.5-Large")

# Note that you must send the tool responses in the same order as the model called the tools:
messages = [
    {
        "role": "user",
        "content": "What's the weather like right now in Jerusalem and in London?"
    },
    {
        "role": "assistant",
        "content": null,
        "tool_calls": [
            {
                "name": "get_current_weather",
                "arguments": "{\"location\": \"Jerusalem\", \"format\": \"celsius\"}"
            },
            {
                "name": "get_current_weather",
                "arguments": "{\"location\": \"London\", \"format\": \"celsius\"}"
            }
        ]
    },
    {
        "role": "tool",
        "content": "The weather in Jerusalem is 18 degrees celsius."
    },
    {
        "role": "tool",
        "content": "The weather in London is 8 degrees celsius."
    }
]

tool_use_prompt = tokenizer.apply_chat_template(
    messages,
    tools=tools,
    tokenize=False,
)

プロンプトへのドキュメントの追加

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("ai21labs/AI21-Jamba-1.5-Large")

messages = [
        {
            "role": "user",
            "content": "Who wrote Harry Potter?"
        }
]

documents = [
        {
            "text": "Harry Potter is a series of seven fantasy novels written by British author J. K. Rowling.",
            "title": "Harry Potter"
        },
        {
            "text": "The Great Gatsby is a novel by American writer F. Scott Fitzgerald.",
            "title": "The Great Gatsby",
            "country": "United States",
            "genre": "Novel"

        }
]

prompt = tokenizer.apply_chat_template(
    messages,
    documents=documents,
    tokenize=False,
)

# Output: J. K. Rowling

JSONモードの使用

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("ai21labs/AI21-Jamba-1.5-Large")
messages = [
    {'role':'user', 
     'content':'Describe the first American president. Include year of birth (number) and name (string).'}
    ]
prompt = tokenizer.apply_chat_template(messages,
                                       tokenize=False,
                                       add_generation_prompt=False,
                                       knobs={"response_format": "json_object", "is_set": True})

#Output: "{ "year of birth": 1732, "name": "George Washington." }"

📚 ドキュメント

モデル詳細

属性	詳細
開発者	AI21
モデルタイプ	結合注意力とMamba（Jamba）
ライセンス	Jambaオープンモデルライセンス
コンテキスト長	256K
知識の締め切り日	2024年3月5日
サポート言語	英語、スペイン語、フランス語、ポルトガル語、イタリア語、オランダ語、ドイツ語、アラビア語、ヘブライ語

一般的なベンチマークテスト結果

汎用ベンチマーク

ベンチマーク	Jamba 1.5 Mini	Jamba 1.5 Large
Arena Hard	46.1	65.4
Wild Bench	42.4	48.5
MMLU (CoT)	69.7	81.2
MMLU Pro (CoT)	42.5	53.5
GPQA	32.3	36.9
ARC Challenge	85.7	93
BFCL	80.6	85.5
GSM - 8K	75.8	87
RealToxicity（低いほど良い）	8.1	6.7
TruthfulQA	54.1	58.3

RULERベンチマーク - 有効コンテキスト長

モデル	宣言長	有効長	4K	8K	16K	32K	64K	128K	256K
Jamba 1.5 Large (94B/398B)	256K	256K	96.7	96.6	96.4	96.0	95.4	95.1	93.9
Jamba 1.5 Mini (12B/52B)	256K	256K	95.7	95.2	94.7	93.8	92.7	89.8	86.1
Gemini 1.5 Pro	1M	>128K	96.7	95.8	96.0	95.9	95.9	94.4	--
GPT - 4 1106 - preview	128K	64K	96.6	96.3	95.2	93.2	87.0	81.2	--
Llama 3.1 70B	128K	64K	96.5	95.8	95.4	94.8	88.4	66.6	--
Command R - plus (104B)	128K	32K	95.6	95.2	94.2	92.0	84.3	63.1	--
Llama 3.1 8B	128K	32K	95.5	93.8	91.6	87.4	84.7	77.0	--
Mistral Large 2 (123B)	128K	32K	96.2	96.1	95.1	93.0	78.8	23.7	--
Mixtral 8x22B (39B/141B)	64K	32K	95.6	94.9	93.4	90.9	84.7	31.7	--
Mixtral 8x7B (12.9B/46.7B)	32K	32K	94.9	92.1	92.5	85.9	72.4	44.5	--

多言語MMLU

言語	Jamba 1.5 Large	Jamba 1.5 Mini
フランス語	75.8	65.9
スペイン語	75.5	66.3
ポルトガル語	75.5	66.7
イタリア語	75.2	65.1
オランダ語	74.6	65.0
ドイツ語	73.9	63.8
アラビア語	67.1	57.3

注意事項

⚠️ 重要な注意事項

transformersの4.44.0と4.44.1バージョンには、Jambaアーキテクチャを実行する能力を制限するバグがあります。これらのバージョンを使用しないようにしてください。

⚠️ 重要な注意事項

最適化されたMambaカーネル用のmamba-ssmとcausal-conv1dのインストールで問題が発生した場合、これらを使用せずにJamba 1.5 Largeを実行できますが、追加の遅延が発生します。この場合は、AutoModelForCausalLM.from_pretained()でモデルを読み込む際に、キーワード引数use_mamba_kernels=Falseを追加してください。

🔧 技術詳細

モデルの微調整

このセクションでは、HFフレームワーク + axolotlとFSDPを使用して、単一の8xA100/H100（80GB GPU）ノード上でJamba 1.5 Largeを微調整する方法について説明します。

最新バージョンのtransformersをFSDPで実行すると、CPU RAMメモリが過剰に使用されます。そのため、修正バージョンを使用します。具体的には、モデルは各ランクで完全にCPUに読み込まれ、rank0だけでなく全体でCPU RAMの使用量が大幅に増加します。Jamba 1.5 Largeでは、必要な200GBではなく1.6TB以上のメモリが必要になります。Wing Lianとaxolotlチームの貢献に感謝します！

最新バージョンのaxolotl（2024年8月21日以降）をインストールするか、提供されているDockerイメージを使用してください。

量子化技術

Jamba 1.5 Largeは非常に大きいため、単一の8つの80GB GPUノード上で全精度（FP32）または半精度（FP16/BF16）で読み込むことはできません。そのため、量子化が必要です。我々は、vLLMでのMoEモデル（Jambaモデルを含む）のデプロイに特化した革新的で高効率な量子化技術ExpertsInt8を開発しました。この技術を使用することで、Jamba 1.5 Largeを単一の8つの80GB GPUノード上にデプロイできます。

📄 ライセンス

このモデルは、Jambaオープンモデルライセンスの下で公開されています。これは、ライセンス条項の範囲内で、全面的な研究使用と商用使用が許可される寛容なライセンスです。このモデルを独自のニーズに合わせてライセンスする場合は、お問い合わせください。このモデルの詳細については、ホワイトペーパーと発表ブログ記事を参照してください。