Jamba-v0.1大規模言語モデルがオープンソースで公開！長所を結合し、超長文脈対応、同規模モデルを上回る処理性能

Jamba V0.1

ai21labsによって開発

Jambaは最先端のハイブリッドSSM-Transformer大規模言語モデルで、MambaアーキテクチャとTransformerの利点を組み合わせ、256Kのコンテキスト長をサポートし、スループットと性能において同規模のモデルを凌駕します。

大規模言語モデル

Transformers

オープンソースライセンス:Apache-2.0 #ハイブリッドSSM-Transformerアーキテクチャ #256K長文コンテキスト処理 #プロダクションレベルMamba実装

ダウンロード数 6,247

リリース時間 : 3/28/2024

モデル概要

Jambaは初のプロダクションレベルMamba実装で、事前訓練された混合専門家(MoE)テキスト生成モデルとして、120億の活性化パラメータと520億の総パラメータを持ちます。テキスト生成、ファインチューニング、研究開発に適しています。

モデル特徴

ハイブリッドアーキテクチャの革新

MambaのSSMアーキテクチャと従来のTransformerを組み合わせ、高性能を維持しながらスループットを向上

超長文コンテキストサポート

256Kトークンのコンテキスト長をサポート、80GB GPU単体で140Kトークンを処理可能

効率的な専門家混合

MoE設計を採用、総パラメータ520億だが活性化は120億パラメータのみで性能と効率を両立

プロダクションレベル実装

実運用可能な初のMambaアーキテクチャ実装で、アプリケーション開発に新たな可能性を提供

モデル能力

長文生成

知識質問応答

テキスト継続

指示ファインチューニング基盤

使用事例

研究開発

アーキテクチャ革新研究

ハイブリッドSSM-Transformerアーキテクチャの性能限界を探求

複数のベンチマークで同規模モデルに匹敵または超越

企業アプリケーション

長文書処理

256Kコンテキスト長を活用した超長文書処理

長距離の意味的一貫性を維持可能

🚀 Jambaモデル

このJambaモデルのベースバージョンです。その後、より良いインストラクションチューニング済みのバージョン Jamba-1.5-Mini をリリースしました。さらに高いパフォーマンスが必要な場合は、拡張版の Jamba-1.5-Large をご確認ください。

🚀 クイックスタート

Jambaは最先端のハイブリッドSSM-Transformer LLMです。従来のTransformerベースのモデルに比べてスループットが向上し、同サイズクラスの主要なモデルを多くの一般的なベンチマークで上回るか同等の性能を発揮します。

このモデルカードはJambaのベースバージョンに関するものです。これは事前学習されたエキスパート混合（MoE）生成テキストモデルで、アクティブなパラメータが120億個、すべてのエキスパートを合わせた総パラメータは520億個です。256Kのコンテキスト長をサポートし、単一の80GB GPUで最大140Kトークンを収容できます。

このモデルの詳細については、ホワイトペーパーとリリースブログ記事をご覧ください。

✨ 主な機能

Jambaは最先端のハイブリッドSSM-Transformer LLMで、従来のTransformerベースのモデルに比べてスループットが向上します。
同サイズクラスの主要なモデルを多くの一般的なベンチマークで上回るか同等の性能を発揮します。
最初の本番規模のMamba実装であり、興味深い研究とアプリケーションの機会を開拓します。

📦 インストール

前提条件

Jambaを使用するには、transformers バージョン4.40.0以上（バージョン4.39.0以上が必要）を使用することをお勧めします。

pip install transformers>=4.40.0

最適化されたMamba実装を実行するには、まず mamba-ssm と causal-conv1d をインストールする必要があります。

pip install mamba-ssm causal-conv1d>=1.2.0

また、モデルをCUDAデバイス上で実行する必要があります。

最適化されたMambaカーネルを使用せずにモデルを実行することもできますが、これは大幅にレイテンシが増加するため お勧めしません。その場合は、モデルをロードする際に use_mamba_kernels=False を指定する必要があります。

💻 使用例

基本的な使用法

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("ai21labs/Jamba-v0.1")
tokenizer = AutoTokenizer.from_pretrained("ai21labs/Jamba-v0.1")

input_ids = tokenizer("In the recent Super Bowl LVIII,", return_tensors='pt').to(model.device)["input_ids"]

outputs = model.generate(input_ids, max_new_tokens=216)

print(tokenizer.batch_decode(outputs))
# ["<|startoftext|>In the recent Super Bowl LVIII, the Kansas City Chiefs emerged victorious, defeating the San Francisco 49ers in a thrilling overtime showdown. The game was a nail-biter, with both teams showcasing their skills and determination.\n\nThe Chiefs, led by their star quarterback Patrick Mahomes, displayed their offensive prowess, while the 49ers, led by their strong defense, put up a tough fight. The game went into overtime, with the Chiefs ultimately securing the win with a touchdown.\n\nThe victory marked the Chiefs' second Super Bowl win in four years, solidifying their status as one of the top teams in the NFL. The game was a testament to the skill and talent of both teams, and a thrilling end to the NFL season.\n\nThe Super Bowl is not just about the game itself, but also about the halftime show and the commercials. This year's halftime show featured a star-studded lineup, including Usher, Alicia Keys, and Lil Jon. The show was a spectacle of music and dance, with the performers delivering an energetic and entertaining performance.\n"]

transformers<4.40.0 を使用している場合は、新しいJambaアーキテクチャを実行するために trust_remote_code=True が必要です。

高度な使用法

半精度でのモデルのロード

公開されているチェックポイントはBF16で保存されています。BF16/FP16でRAMにロードするには、torch_dtype を指定する必要があります。

from transformers import AutoModelForCausalLM
import torch
model = AutoModelForCausalLM.from_pretrained("ai21labs/Jamba-v0.1",
                                             torch_dtype=torch.bfloat16)    # you can also use torch_dtype=torch.float16

半精度を使用する場合、Attentionブロックの FlashAttention2 実装を有効にすることができます。これを使用するには、モデルをCUDAデバイス上に配置する必要もあります。この精度ではモデルが大きすぎて単一の80GB GPUに収まらないため、accelerate を使用して並列化する必要があります。

from transformers import AutoModelForCausalLM
import torch
model = AutoModelForCausalLM.from_pretrained("ai21labs/Jamba-v0.1",
                                             torch_dtype=torch.bfloat16,
                                             attn_implementation="flash_attention_2",
                                             device_map="auto")

8ビットでのモデルのロード

8ビット精度を使用すると、単一の80GB GPUで最大140Kのシーケンス長を収容することができます。 bitsandbytes を使用してモデルを簡単に8ビットに量子化することができます。モデルの品質を低下させないために、Mambaブロックを量子化から除外することをお勧めします。

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(load_in_8bit=True,
                                         llm_int8_skip_modules=["mamba"])
model = AutoModelForCausalLM.from_pretrained("ai21labs/Jamba-v0.1",
                                             torch_dtype=torch.bfloat16,
                                             attn_implementation="flash_attention_2",
                                             quantization_config=quantization_config)

ファインチューニングの例

Jambaは、カスタムソリューション（チャット/インストラクションバージョンを含む）のためにファインチューニングできるベースモデルです。好きな手法を使用してファインチューニングすることができます。以下は、PEFT ライブラリを使用したファインチューニングの例です（約120GBのGPU RAMが必要です。例：2xA100 80GB）。

import torch
from datasets import load_dataset
from trl import SFTTrainer, SFTConfig
from peft import LoraConfig
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments

tokenizer = AutoTokenizer.from_pretrained("ai21labs/Jamba-v0.1")
model = AutoModelForCausalLM.from_pretrained(
    "ai21labs/Jamba-v0.1", device_map='auto', torch_dtype=torch.bfloat16)

lora_config = LoraConfig(
    r=8,
    target_modules=[
        "embed_tokens", 
        "x_proj", "in_proj", "out_proj", # mamba
        "gate_proj", "up_proj", "down_proj", # mlp
        "q_proj", "k_proj", "v_proj" # attention
    ],
    task_type="CAUSAL_LM",
    bias="none"
)

dataset = load_dataset("Abirate/english_quotes", split="train")
training_args = SFTConfig(
    output_dir="./results",
    num_train_epochs=2,
    per_device_train_batch_size=4,
    logging_dir='./logs',
    logging_steps=10,
    learning_rate=1e-5,
    dataset_text_field="quote",
)
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    args=training_args,
    peft_config=lora_config,
    train_dataset=dataset,
)
trainer.train()