Jetmoe-8B-Chat Open Source Large Language Model - Low-cost Training, Outperforming LLaMA2-7B!

Jetmoe 8b Chat

Developed by jetmoe

JetMoE-8B is an efficient open-source large language model that surpasses LLaMA2-7B performance with a low training cost of $100,000, activating only 2.2 billion parameters during inference

Large Language Model

Transformers

Open Source License:Apache-2.0 #Low-cost efficient training #Sparse activation inference #Open-source and academic-friendly

Downloads 26

Release Time : 3/31/2024

Model Overview

An open-source large language model based on Mixture of Experts (MoE) architecture, focusing on efficient inference and low-cost training, suitable for tasks like dialogue generation and code completion

Model Features

Low-cost efficient training

Achieved performance surpassing LLaMA2-7B with only $100,000 cost (96×H100 trained for 2 weeks)

Efficient inference

Only activates 2.2 billion parameters during inference, significantly reducing computational costs

Fully open-source

Trained using public datasets, open-source code, supports fine-tuning on consumer-grade GPUs

Two-phase training approach

Adopts MiniCPM training method: Phase 1 base training + Phase 2 high-quality data fine-tuning

Model Capabilities

Text generation

Dialogue systems

Code completion

Mathematical problem solving

Multi-turn dialogue

Use Cases

Dialogue systems

Intelligent chatbot

Build friendly and knowledgeable conversational assistants

MT-Bench score of 6.681, surpassing Llama-2-13b-chat

Code generation

Programming assistance

Automatically generate and complete code

MBPP benchmark Pass@1 reached 34.2%, outperforming LLaMA2-7B

🚀 JetMoE: Reaching LLaMA2 Performance with 0.1M Dollars

JetMoE-8B is trained with low cost, outperforms LLaMA2-7B, is fully open - sourced, and has low inference computational cost.

🚀 Quick Start

Here's a quick example to get you started with JetMoE-8B-chat:

import torch
from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM
# Initialize the model and tokenizer
model_name = "jetmoe/jetmoe-8b-chat"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16, attn_implementation="eager", trust_remote_code=True)
# Check if a GPU is available and move the model to GPU if it is
if torch.cuda.is_available():
    model = model.cuda()
    print("Using GPU:", torch.cuda.get_device_name(torch.cuda.current_device()))
else:
    print("GPU is not available, using CPU instead.")
# Encode input context
messages = [
    {
        "role": "system",
        "content": "You are a friendly chatbot",
    },
    {"role": "user", "content": "How many helicopters can a human eat in one sitting?"},
 ]
tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt")
print(tokenized_chat)
# If using a GPU, move the input IDs to the GPU
if torch.cuda.is_available():
    input_ids = tokenized_chat.cuda()
# Generate text
output = model.generate(input_ids, max_length=500, num_return_sequences=1, no_repeat_ngram_size=2)
# If the output is on the GPU, move it back to CPU for decoding
if torch.cuda.is_available():
    output = output.cpu()
# Decode the generated text
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
print(generated_text)

✨ Features

Low - cost training: JetMoE-8B is trained with less than $ 0.1 million¹ cost but outperforms LLaMA2-7B from Meta AI, who has multi - billion - dollar training resources. LLM training can be much cheaper than people previously thought.
Open - sourced and academia - friendly:
- It only uses public datasets for training, and the code is open - sourced. No proprietary resource is needed.
- It can be finetuned with very limited compute budget (e.g., consumer - grade GPU) that most labs can afford.
Low inference cost: JetMoE-8B only has 2.2B active parameters during inference, which drastically lowers the computational cost. Compared to a model with similar inference computation, like Gemma - 2B, JetMoE-8B achieves constantly better performance.

¹ We used a 96×H100 GPU cluster for 2 weeks, which cost ~$0.08 million.

📚 Documentation

Benchmarks

We use the same evaluation methodology as in the Open LLM leaderboard. For MBPP code benchmark, we use the same evaluation methodology as in the LLaMA2 and Deepseek - MoE paper. The results are shown below:

Model	Activate Params	Training Tokens	Open LLM Leaderboard Avg	ARC	Hellaswag	MMLU	TruthfulQA	WinoGrande	GSM8k	MBPP	HumanEval
Shot				25	10	5	0	5	5	3	0
Metric				acc_norm	acc_norm	acc	mc2	acc	acc	Pass@1	Pass@1
LLaMA2-7B	7B	2T	51.0	53.1	78.6	46.9	38.8	74	14.5	20.8	12.8
LLaMA-13B	13B	1T	51.4	56.2	80.9	47.7	39.5	76.2	7.6	22.0	15.8
DeepseekMoE-16B	2.8B	2T	51.1	53.2	79.8	46.3	36.1	73.7	17.3	34.0	25.0
Gemma-2B	2B	2T	46.4	48.4	71.8	41.8	33.1	66.3	16.9	28.0	24.4
JetMoE-8B	2.2B	1.25T	53.0	48.7	80.5	49.2	41.7	70.2	27.8	34.2	14.6

Model	MT - Bench Score
GPT - 4	9.014
GPT - 3.5 - turbo	7.995
Claude - v1	7.923
JetMoE - 8B - chat	6.681
Llama - 2 - 13b - chat	6.650
Vicuna - 13b - v1.3	6.413
Wizardlm - 13b	6.353
Llama - 2 - 7b - chat	6.269

To our surprise, despite the lower training cost and computation, JetMoE-8B performs even better than LLaMA2-7B, LLaMA-13B, and DeepseekMoE-16B. Compared to a model with similar training and inference computation, like Gemma - 2B, JetMoE-8B achieves better performance.

Model Details

JetMoE-8B has 24 blocks. Each block has two MoE layers: Mixture of Attention heads (MoA) and Mixture of MLP Experts (MoE). Each MoA and MoE layer has 8 expert, and 2 experts are activated for each input token. It has 8 billion parameters in total and 2.2B active parameters. JetMoE-8B is trained on 1.25T tokens from publicly available datasets, with a learning rate of 5.0 x 10^-4 and a global batch - size of 4M tokens.

Training Details

Our training recipe follows the MiniCPM's two - phases training method. Phase 1 uses a constant learning rate with linear warmup and is trained on 1 trillion tokens from large - scale open - source pretraining datasets, including RefinedWeb, Pile, Github data, etc. Phase 2 uses exponential learning rate decay and is trained on 250 billion tokens from phase 1 datasets and extra high - quality open - source datasets.

JetMoE Model Index

Model	Index
JetMoE-8B-Base	Link
JetMoE-8B-SFT	Link
JetMoE-8B-Chat	Link

Technical Report

For more details, please refer to the JetMoE Technical Report.

🔧 Technical Details

The project is contributed by Yikang Shen, Zhen Guo, Tianle Cai and Zengyi Qin. For technical inquiries, please contact Yikang Shen. For media and collaboration inquiries, please contact Zengyi Qin.

Collaboration

If you have great ideas but need more resources (GPU, data, funding, etc.), welcome to contact MyShell.ai via Zengyi Qin. MyShell.ai is open to collaborations and are actively supporting high - quality open - source projects.

📄 License

The model is under the Apache - 2.0 license.

Acknowledgement

We express our gratitude to Shengding Hu for his valuable advice on the Phase 2 data mixture. We also express our gratitude to Exabits for their assistance in setting up the GPU clusters, and to Lepton AI for their support in setting up the chat demo.