jetmoe - 8b - chatオープンソース大規模言語モデル - 低コストトレーニングで、性能はLLaMA2

ホーム

Jetmoe 8b Chat

jetmoeによって開発

JetMoE-8Bは効率的なオープンソースの大規模言語モデルで、10万ドルの低コストでトレーニングされ、LLaMA2-7Bの性能を超え、推論時には22億パラメータのみを活性化します

大規模言語モデル

Transformers

オープンソースライセンス:Apache-2.0 #低コスト高効率トレーニング #スパース活性化推論 #オープンソース学術フレンドリー

ダウンロード数 26

リリース時間 : 3/31/2024

モデル概要

混合エキスパートアーキテクチャ(MoE)に基づくオープンソースの大規模言語モデルで、効率的な推論と低コストトレーニングに焦点を当て、対話生成、コード補完などのタスクに適しています

モデル特徴

低コスト高効率トレーニング

わずか10万ドルのコスト（96×H100で2週間トレーニング）でLLaMA2-7Bの性能を超えます

効率的な推論

推論時には22億パラメータのみを活性化し、計算コストを大幅に削減します

完全オープンソース

公開データセットでトレーニングされ、コードはオープンソースで、消費級GPUでのファインチューニングをサポートします

2段階トレーニングスキーム

MiniCPMトレーニング法を採用：段階1の基礎トレーニング+段階2の高品質データファインチューニング

モデル能力

テキスト生成

対話システム

コード補完

数学問題解決

マルチターン対話

使用事例

対話システム

インテリジェントチャットボット

友好的で知識豊富な対話アシスタントを構築

MT-Benchスコア6.681、Llama-2-13b-chatを超える

コード生成

プログラミング支援

コードを自動生成および補完

MBPPベンチマークPass@1で34.2%、LLaMA2-7Bを上回る

🚀 JetMoE: 0.1 millionドルでLLaMA2の性能を達成

JetMoE-8Bは、0.1百万ドル未満のコストで訓練され、Meta AIのLLaMA2-7Bを上回る性能を発揮します。また、完全にオープンソースで学術界にやさしいモデルで、推論時のアクティブパラメータが少なく、計算コストを大幅に削減します。

🚀 クイックスタート

以下は、JetMoE-8B-chatを使い始めるための簡単な例です。

import torch
from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM
# Initialize the model and tokenizer
model_name = "jetmoe/jetmoe-8b-chat"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16, attn_implementation="eager", trust_remote_code=True)
# Check if a GPU is available and move the model to GPU if it is
if torch.cuda.is_available():
    model = model.cuda()
    print("Using GPU:", torch.cuda.get_device_name(torch.cuda.current_device()))
else:
    print("GPU is not available, using CPU instead.")
# Encode input context
messages = [
    {
        "role": "system",
        "content": "You are a friendly chatbot",
    },
    {"role": "user", "content": "How many helicopters can a human eat in one sitting?"},
 ]
tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt")
print(tokenized_chat)
# If using a GPU, move the input IDs to the GPU
if torch.cuda.is_available():
    input_ids = tokenized_chat.cuda()
# Generate text
output = model.generate(input_ids, max_length=500, num_return_sequences=1, no_repeat_ngram_size=2)
# If the output is on the GPU, move it back to CPU for decoding
if torch.cuda.is_available():
    output = output.cpu()
# Decode the generated text
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
print(generated_text)

✨ 主な機能

低コストで高性能：JetMoE-8Bは、0.1百万ドル未満のコストで訓練され、多額の訓練リソースを持つMeta AIのLLaMA2-7Bを上回る性能を発揮します。
完全オープンソース：公開データセットのみを使用して訓練され、コードもオープンソースです。独自のリソースは必要ありません。
低計算コスト：推論時には2.2Bのアクティブパラメータのみを使用するため、計算コストを大幅に削減します。

📦 インストール

インストールに関する具体的な手順は提供されていません。

💻 使用例

基本的な使用法

import torch
from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM
# Initialize the model and tokenizer
model_name = "jetmoe/jetmoe-8b-chat"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16, attn_implementation="eager", trust_remote_code=True)
# Check if a GPU is available and move the model to GPU if it is
if torch.cuda.is_available():
    model = model.cuda()
    print("Using GPU:", torch.cuda.get_device_name(torch.cuda.current_device()))
else:
    print("GPU is not available, using CPU instead.")
# Encode input context
messages = [
    {
        "role": "system",
        "content": "You are a friendly chatbot",
    },
    {"role": "user", "content": "How many helicopters can a human eat in one sitting?"},
 ]
tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt")
print(tokenized_chat)
# If using a GPU, move the input IDs to the GPU
if torch.cuda.is_available():
    input_ids = tokenized_chat.cuda()
# Generate text
output = model.generate(input_ids, max_length=500, num_return_sequences=1, no_repeat_ngram_size=2)
# If the output is on the GPU, move it back to CPU for decoding
if torch.cuda.is_available():
    output = output.cpu()
# Decode the generated text
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
print(generated_text)

📚 ドキュメント

モデルの詳細

JetMoE-8Bは24個のブロックから構成されています。各ブロックには、Mixture of Attention heads (MoA) とMixture of MLP Experts (MoE) の2つのMoEレイヤーがあります。各MoAとMoEレイヤーには8つのエキスパートがあり、各入力トークンに対して2つのエキスパートがアクティブになります。総計80億個のパラメータを持ち、アクティブパラメータは2.2Bです。JetMoE-8Bは、公開データセットからの1.25Tのトークンで訓練され、学習率は5.0 x 10^-4、グローバルバッチサイズは4Mトークンです。

訓練の詳細

私たちの訓練方法は、MiniCPMの2段階訓練法に従っています。フェーズ1では、一定の学習率で線形ウォームアップを行い、RefinedWeb、Pile、Githubデータなどの大規模オープンソース事前訓練データセットからの1兆トークンで訓練します。フェーズ2では、指数関数的な学習率減衰を使用し、フェーズ1のデータセットと追加の高品質オープンソースデータセットからの2500億トークンで訓練します。

技術レポート

詳細については、JetMoE Technical Reportを参照してください。

JetMoEモデルインデックス

Model	Index
JetMoE-8B-Base	Link
JetMoE-8B-SFT	Link
JetMoE-8B-Chat	Link

🔧 技術詳細

JetMoE-8Bの技術的な詳細は、上記のモデルの詳細と訓練の詳細で説明されています。

📄 ライセンス

このプロジェクトは、Apache-2.0ライセンスの下で公開されています。

その他の情報

ベンチマーク

私たちは、Open LLM leaderboardと同じ評価方法を使用しています。MBPPコードベンチマークについては、LLaMA2とDeepseek-MoE論文と同じ評価方法を使用しています。結果は以下の通りです。

Model	Activate Params	Training Tokens	Open LLM Leaderboard Avg	ARC	Hellaswag	MMLU	TruthfulQA	WinoGrande	GSM8k	MBPP	HumanEval
Shot				25	10	5	0	5	5	3	0
Metric				acc_norm	acc_norm	acc	mc2	acc	acc	Pass@1	Pass@1
LLaMA2-7B	7B	2T	51.0	53.1	78.6	46.9	38.8	74	14.5	20.8	12.8
LLaMA-13B	13B	1T	51.4	56.2	80.9	47.7	39.5	76.2	7.6	22.0	15.8
DeepseekMoE-16B	2.8B	2T	51.1	53.2	79.8	46.3	36.1	73.7	17.3	34.0	25.0
Gemma-2B	2B	2T	46.4	48.4	71.8	41.8	33.1	66.3	16.9	28.0	24.4
JetMoE-8B	2.2B	1.25T	53.0	48.7	80.5	49.2	41.7	70.2	27.8	34.2	14.6