Jamba - v0.1 - 9Bオープンソース大規模言語モデル、超長文脈推論、単一80GB GPUで利用可能！

Jamba V0.1 9B

TechxGenusによって開発

Jambaは最先端のハイブリッドSSM-Transformerアーキテクチャの大規模言語モデルで、アテンション機構とMambaアーキテクチャの利点を組み合わせ、256Kのコンテキスト長をサポートし、80GB GPU単体での推論に適しています。

大規模言語モデル

Transformers

オープンソースライセンス:Apache-2.0 #ハイブリッドSSM-Transformerアーキテクチャ #256K長文コンテキスト対応 #シングルGPU効率的推論

ダウンロード数 22

リリース時間 : 4/8/2024

モデル概要

Jambaは事前訓練された混合エキスパート（MoE）テキスト生成モデルで、活性化パラメータ120億、全エキスパート総パラメータ520億です。同サイズモデルの中で、ほとんどの一般的なベンチマークで最高性能モデルと同等以上の性能を発揮します。

モデル特徴

ハイブリッドアーキテクチャ

Transformerのアテンション機構とMambaアーキテクチャの利点を組み合わせ、モデルのスループットを向上させました。

長文コンテキスト対応

最大256Kのコンテキスト長をサポートし、長文書や複雑なタスクの処理に適しています。

効率的な推論

最適化された実装により、80GB GPU単体で最大140Kトークンを処理可能で、実際のデプロイに適しています。

混合エキスパート（MoE）

混合エキスパートアーキテクチャを採用し、活性化パラメータ120億、総パラメータ520億で、性能と効率のバランスを実現しました。

モデル能力

テキスト生成

長文コンテキスト処理

効率的な推論

使用事例

テキスト生成

コンテンツ作成

高品質な記事、ストーリーなどのテキストコンテンツを生成します。

コード生成

開発者がコードスニペットを生成したりプログラミングタスクを完了するのを支援します。

研究開発

モデルファインチューニング

PEFTライブラリを使用して特定タスクに適応できるベースモデルとして利用可能です。

🚀 Jamba-v0.1-9B

Jamba-v0.1 の高密度バージョンで、最初のエキスパートの重みを抽出しています。これはもはやMoEを使用しなくなりまし。詳細については、このスクリプトを参照してください。単一の3090/4090で推論を行うことができ、使用方法はJamba-v0.1とまったく同じです。

🚀 クイックスタート

Jambaは最先端のハイブリッドSSM-Transformer LLMです。従来のTransformerベースのモデルに比べてスループットが向上し、同サイズクラスの主要モデルと同等またはそれ以上の性能を、ほとんどの一般的なベンチマークで発揮します。

このモデルカードはJambaのベースバージョンに関するものです。事前学習済みのエキスパート混合（MoE）生成テキストモデルで、アクティブパラメータが120億、すべてのエキスパートを合わせた総パラメータは520億です。256Kのコンテキスト長をサポートし、単一の80GB GPUで最大140Kトークンを扱うことができます。

このモデルの詳細については、リリースブログ記事をご覧ください。

✨ 主な機能

最先端のハイブリッドSSM-Transformer LLMで、従来のTransformerベースのモデルに比べてスループットが向上します。
同サイズクラスの主要モデルと同等またはそれ以上の性能を、ほとんどの一般的なベンチマークで発揮します。
256Kのコンテキスト長をサポートし、単一の80GB GPUで最大140Kトークンを扱うことができます。

📦 インストール

前提条件

Jambaでは transformers バージョン4.39.0以上が必要です。

pip install transformers>=4.39.0

最適化されたMambaの実装を実行するには、まず mamba-ssm と causal-conv1d をインストールする必要があります。

pip install mamba-ssm causal-conv1d>=1.2.0

また、モデルをCUDAデバイス上で実行する必要があります。

最適化されたMambaカーネルを使用せずにモデルを実行することもできますが、これは大幅にレイテンシが増加するため お勧めしません。その場合は、モデルをロードする際に use_mamba_kernels=False を指定する必要があります。

💻 使用例

基本的な使用法

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("ai21labs/Jamba-v0.1",
                                             trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("ai21labs/Jamba-v0.1")

input_ids = tokenizer("In the recent Super Bowl LVIII,", return_tensors='pt').to(model.device)["input_ids"]

outputs = model.generate(input_ids, max_new_tokens=216)

print(tokenizer.batch_decode(outputs))
# ["<|startoftext|>In the recent Super Bowl LVIII, the Kansas City Chiefs emerged victorious, defeating the San Francisco 49ers in a thrilling overtime showdown. The game was a nail-biter, with both teams showcasing their skills and determination.\n\nThe Chiefs, led by their star quarterback Patrick Mahomes, displayed their offensive prowess, while the 49ers, led by their strong defense, put up a tough fight. The game went into overtime, with the Chiefs ultimately securing the win with a touchdown.\n\nThe victory marked the Chiefs' second Super Bowl win in four years, solidifying their status as one of the top teams in the NFL. The game was a testament to the skill and talent of both teams, and a thrilling end to the NFL season.\n\nThe Super Bowl is not just about the game itself, but also about the halftime show and the commercials. This year's halftime show featured a star-studded lineup, including Usher, Alicia Keys, and Lil Jon. The show was a spectacle of music and dance, with the performers delivering an energetic and entertaining performance.\n"]

高度な使用法

半精度でモデルをロードする

公開されているチェックポイントはBF16で保存されています。BF16/FP16でRAMにロードするには、torch_dtype を指定する必要があります。

from transformers import AutoModelForCausalLM
import torch
model = AutoModelForCausalLM.from_pretrained("ai21labs/Jamba-v0.1",
                                             trust_remote_code=True,
                                             torch_dtype=torch.bfloat16)    # you can also use torch_dtype=torch.float16

半精度を使用する場合、Attentionブロックの FlashAttention2 実装を有効にすることができます。これを使用するには、モデルをCUDAデバイス上に配置する必要もあります。この精度ではモデルが大きすぎて単一の80GB GPUに収まらないため、accelerate を使用して並列化する必要もあります。

from transformers import AutoModelForCausalLM
import torch
model = AutoModelForCausalLM.from_pretrained("ai21labs/Jamba-v0.1",
                                             trust_remote_code=True,
                                             torch_dtype=torch.bfloat16,
                                             attn_implementation="flash_attention_2",
                                             device_map="auto")

8ビットでモデルをロードする

8ビット精度を使用すると、単一の80GB GPUで最大140Kのシーケンス長を収めることができます。 bitsandbytes を使用してモデルを簡単に8ビットに量子化することができます。モデルの品質を低下させないために、Mambaブロックを量子化から除外することをお勧めします。

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(load_in_8bit=True,
                                         llm_int8_skip_modules=["mamba"])
model = AutoModelForCausalLM.from_pretrained("ai21labs/Jamba-v0.1",
                                             trust_remote_code=True,
                                             torch_dtype=torch.bfloat16,
                                             attn_implementation="flash_attention_2",
                                             quantization_config=quantization_config)

ファインチューニングの例

Jambaは、カスタムソリューション（チャット/命令バージョンを含む）のためにファインチューニングできるベースモデルです。好きな手法を使ってファインチューニングすることができます。以下は PEFT ライブラリを使用したファインチューニングの例です。

from datasets import load_dataset
from trl import SFTTrainer
from peft import LoraConfig
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments

tokenizer = AutoTokenizer.from_pretrained("ai21labs/Jamba-v0.1")
model = AutoModelForCausalLM.from_pretrained("ai21labs/Jamba-v0.1", trust_remote_code=True, device_map='auto')

dataset = load_dataset("Abirate/english_quotes", split="train")
training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    logging_dir='./logs',
    logging_steps=10,
    learning_rate=2e-3
)
lora_config = LoraConfig(
    r=8,
    target_modules=["embed_tokens", "x_proj", "in_proj", "out_proj"],
    task_type="CAUSAL_LM",
    bias="none"
)
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    args=training_args,
    peft_config=lora_config,
    train_dataset=dataset,
    dataset_text_field="quote",
)

trainer.train()