Zamba2-2.7B Open-source AI Model - High performance and low latency, easily meet diverse task requirements

Zamba2 2.7B

Developed by Zyphra

Zamba2-2.7B is a hybrid model composed of state space and Transformer modules, using the Mamba2 module and shared attention module, featuring high performance and low latency.

Large Language Model

Transformers

Open Source License:Apache-2.0 #Hybrid state space architecture #Low-latency inference #Device-side optimization

Downloads 2,550

Release Time : 7/9/2024

Model Overview

Zamba2-2.7B is a hybrid architecture model that combines state space and Transformer modules, achieving high performance and low-latency inference through the Mamba2 module and shared attention module.

Model Features

Hybrid architecture

Combining state space and Transformer modules, leveraging the Mamba2 module and shared attention module to improve performance.

Parameter optimization

By sharing attention weights and applying LoRA projectors, deep specialization is achieved while controlling the number of parameters.

High performance

Achieves leading performance among models with less than 3B parameters and is competitive with larger-scale models.

Low latency and small memory footprint

The unique hybrid SSM architecture enables extremely low inference latency, fast generation speed, and small memory footprint.

Model Capabilities

Text generation

Code generation

General language understanding

Use Cases

General language model applications

Question-answering system

Used to answer complex questions, such as historical event analysis.

Generate detailed and accurate answers.

Code generation

Generate code snippets based on natural language descriptions.

Generate code that meets the description.

🚀 Model Card for Zamba2-2.7B

Zamba2-2.7B is a hybrid model that combines state-space and transformer blocks, offering high performance with low latency and memory usage.

🚀 Quick Start

Prerequisites

To use Zamba2-2.7B, you need to install transformers from source:

git clone https://github.com/huggingface/transformers.git
cd transformers && pip install .

To install the dependencies required to run Mamba2 kernels, install mamba-ssm from source (due to compatibility issues with PyTorch) and causal-conv1d:

git clone https://github.com/state-spaces/mamba.git
cd mamba && git checkout v2.1.0 && pip install .
pip install causal-conv1d

You can run the model without using the optimized Mamba2 kernels, but it is not recommended as it will lead to significantly higher latency and memory usage.

Inference

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

tokenizer = AutoTokenizer.from_pretrained("Zyphra/Zamba2-2.7B")
model = AutoModelForCausalLM.from_pretrained("Zyphra/Zamba2-2.7B", device_map="cuda", torch_dtype=torch.bfloat16)

input_text = "What factors contributed to the fall of the Roman Empire?"
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")

outputs = model.generate(**input_ids, max_new_tokens=100)
print(tokenizer.decode(outputs[0]))

✨ Features

Zamba2-2.7B has three major improvements over Zamba1:

Mamba1 blocks have been replaced with Mamba2 blocks.
Instead of a single shared attention block, two shared attention blocks are used, interleaved in an ABAB pattern throughout the network.
A LoRA projector is applied to each shared MLP block, allowing the network to specialize the MLPs at each invocation of the shared layer across depth with only a minimal increase in total parameter count.

📦 Installation

The installation steps are included in the Quick Start section.

💻 Usage Examples

Basic Usage

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

tokenizer = AutoTokenizer.from_pretrained("Zyphra/Zamba2-2.7B")
model = AutoModelForCausalLM.from_pretrained("Zyphra/Zamba2-2.7B", device_map="cuda", torch_dtype=torch.bfloat16)

input_text = "What factors contributed to the fall of the Roman Empire?"
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")

outputs = model.generate(**input_ids, max_new_tokens=100)
print(tokenizer.decode(outputs[0]))

📚 Documentation

Model Details

Zamba2-2.7B utilizes and extends the original Zamba hybrid SSM-attention architecture. The core Zamba architecture consists of a backbone of Mamba layers interleaved with one or more shared attention layers (one shared attention in Zamba1, two in Zamba2). This attention has shared weights to minimize the parameter cost of the model. Concatenating the original model embeddings to the input of this attention block improves performance, likely due to better maintenance of information across depth. The Zamba2 architecture also applies LoRA projection matrices to the shared MLP to gain additional expressivity in each block and allow each shared block to specialize slightly to its own unique position while keeping the additional parameter overhead small.

Performance

Zamba2-2.7B achieves leading and state-of-the-art performance among models of <3B parameters and is competitive with some models of significantly greater size. Due to its unique hybrid SSM architecture, Zamba2-2.7B has extremely low inference latency and rapid generation with a significantly smaller memory footprint than comparable transformer-based models. Its high performance and small inference compute and memory footprint make it an ideal generalist model for on-device applications.

Time to First Token (TTFT)	Output Generation

📄 License

This project is licensed under the Apache-2.0 license.

⚠️ Important Note

Zamba2-2.7B is a pretrained base model and does not have any moderation mechanism, so it may output toxic or otherwise harmful language. Also, do not expect good instruct or chat performance as this model was not fine-tuned for instruction following or chat.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご