đ Arctic: A Dense-MoE Hybrid Transformer Model
Arctic is a dense-MoE Hybrid transformer architecture developed by the Snowflake AI Research Team. It offers pre - trained model checkpoints for both base and instruct - tuned versions, available under the Apache - 2.0 license, enabling free use in research, prototypes, and products.
đ Quick Start
Arctic is currently supported with transformers
by leveraging the custom code feature. To use it, simply add trust_remote_code=True
to your AutoTokenizer and AutoModelForCausalLM calls. However, it's recommended to use a transformers
version at or above 4.39:
pip install transformers>=4.39.0
Arctic also leverages several features from DeepSpeed. You'll need to install DeepSpeed 0.14.2 or higher to get all the required features:
pip install deepspeed>=0.14.2
⨠Features
- Hybrid Architecture: Combines a 10B dense transformer model with a residual 128x3.66B MoE MLP, resulting in 480B total and 17B active parameters chosen using a top - 2 gating.
- Free to Use: Released under an Apache - 2.0 license, allowing free use in various projects.
- Multiple Versions: Available in both base and instruct - tuned versions.
đĻ Installation
To use Arctic, you need to install the required libraries as mentioned above:
pip install transformers>=4.39.0
pip install deepspeed>=0.14.2
đģ Usage Examples
Basic Usage
import os
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from deepspeed.linear.config import QuantizationConfig
tokenizer = AutoTokenizer.from_pretrained(
"Snowflake/snowflake-arctic-instruct",
trust_remote_code=True
)
quant_config = QuantizationConfig(q_bits=8)
model = AutoModelForCausalLM.from_pretrained(
"Snowflake/snowflake-arctic-instruct",
trust_remote_code=True,
low_cpu_mem_usage=True,
device_map="auto",
ds_quantization_config=quant_config,
max_memory={i: "150GiB" for i in range(8)},
torch_dtype=torch.bfloat16)
content = "5x + 35 = 7x - 60 + 10. Solve for x"
messages = [{"role": "user", "content": content}]
input_ids = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").to("cuda")
outputs = model.generate(input_ids=input_ids, max_new_tokens=256)
print(tokenizer.decode(outputs[0]))
Advanced Usage
Due to the model size, it's recommended to use a single 8xH100 instance from a cloud provider like AWS p5.48xlarge or Azure ND96isr_H100_v5. You can also use FP6 quantization by specifying q_bits = 6
in the QuantizationConfig
config.
đ Documentation
For more information on Arctic, including details about its architecture, training process, data, etc., see our series of cookbooks.
The Arctic github page has additional code snippets and examples around running inference:
- Example with pure - HF: https://github.com/Snowflake-Labs/snowflake-arctic/blob/main/inference
- Tutorial using vLLM: https://github.com/Snowflake-Labs/snowflake-arctic/tree/main/inference/vllm
đ§ Technical Details
Arctic combines a 10B dense transformer model with a residual 128x3.66B MoE MLP resulting in 480B total and 17B active parameters chosen using a top - 2 gating.
đ License
This model is released under the Apache - 2.0 license.
Model Information Table