đ Elastic model: Llama-3.2-1B-Instruct. Fastest and most flexible models for self-serving.
Elastic models are designed to offer users the fastest and most flexible solutions for self - serving. They are produced by TheStage AI ANNA: Automated Neural Networks Accelerator. ANNA enables users to easily control model size, latency, and quality with a simple slider movement. For each base model, ANNA generates a series of optimized models:
- XL: A mathematically equivalent neural network optimized by our DNN compiler.
- L: A near - lossless model with less than 1% degradation on corresponding benchmarks.
- M: A faster model with accuracy degradation less than 1.5%.
- S: The fastest model with accuracy degradation less than 2%.
Goals of elastic models
- Provide flexibility in cost - quality selection during inference.
- Offer clear quality and latency benchmarks.
- Provide an interface for HF libraries (transformers and diffusers) with a single line of code.
- Support a wide range of pre - compiled hardware, eliminating the need for JIT.
- Provide the best models and services for self - hosting.
â ī¸ Important Note
Specific quality degradation can vary from model to model. For example, an S model may have a 0.5% degradation.

đ Quick Start
Model Information
Property |
Details |
Model Type |
Elastic model based on Llama - 3.2 - 1B - Instruct |
Base Model |
meta - llama/Llama - 3.2 - 1B - Instruct |
Base Model Relation |
Quantized |
Pipeline Tag |
text2text - generation |
Supported Languages |
Chinese, English, French, Spanish, Portuguese, German, Italian, Russian, Japanese, Korean, Vietnamese, Thai, Arabic |
License |
Apache - 2.0 |
đĻ Installation
To work with our models, follow these steps:
- Install the necessary packages:
pip install thestage
pip install elastic_models[nvidia]\
--index-url https://thestage.jfrog.io/artifactory/api/pypi/pypi-thestage-ai-production/simple\
--extra-index-url https://pypi.nvidia.com\
--extra-index-url https://pypi.org/simple
pip install flash_attn==2.7.3 --no-build-isolation
pip uninstall apex
- Go to app.thestage.ai, log in, and generate an API token from your profile page.
- Set up the API token:
thestage config set --api-token <YOUR_API_TOKEN>
đģ Usage Examples
Basic Usage
To infer our models, replace transformers
import with elastic_models.transformers
:
import torch
from transformers import AutoTokenizer
from elastic_models.transformers import AutoModelForCausalLM
model_name = "meta-llama/Llama-3.2-1B-Instruct"
hf_token = ''
device = torch.device("cuda")
tokenizer = AutoTokenizer.from_pretrained(
model_name, token=hf_token
)
model = AutoModelForCausalLM.from_pretrained(
model_name,
token=hf_token,
torch_dtype=torch.bfloat16,
attn_implementation="sdpa",
mode='S'
).to(device)
model.generation_config.pad_token_id = tokenizer.eos_token_id
prompt = "Describe basics of DNNs quantization."
messages = [
{
"role": "system",
"content": "You are a search bot, answer on user text queries."
},
{
"role": "user",
"content": prompt
}
]
chat_prompt = tokenizer.apply_chat_template(
messages, add_generation_prompt=True, tokenize=False
)
inputs = tokenizer(chat_prompt, return_tensors="pt")
inputs.to(device)
with torch.inference_mode:
generate_ids = model.generate(**inputs, max_length=500)
input_len = inputs['input_ids'].shape[1]
generate_ids = generate_ids[:, input_len:]
output = tokenizer.batch_decode(
generate_ids,
skip_special_tokens=True,
clean_up_tokenization_spaces=False
)[0]
print(f"# Q:\n{prompt}\n")
print(f"# A:\n{output}\n")
đ Documentation
System Requirements
- GPUs: H100, L40s
- CPU: AMD, Intel
- Python: 3.10 - 3.12
đ§ Technical Details
Benchmarks
Benchmarking is crucial during model acceleration. We aim to provide clear performance metrics for models using our algorithms. The W8A8, int8
column indicates that W8A8 quantization with int8 data type was applied to all linear layers, using the same calibration data as for ANNA. The S model achieves similar speed but higher quality, as ANNA can improve quantization quality on sensitive layers.
Quality benchmarks
Metric/Model |
S |
M |
L |
XL |
Original |
W8A8, int8 |
MMLU |
45.5 |
45.9 |
45.9 |
46.2 |
46.2 |
24 |
PIQA |
73.1 |
73.7 |
74.2 |
74.3 |
74.3 |
55.8 |
Arc Challenge |
34.5 |
35.9 |
36.0 |
35.8 |
35.8 |
20.3 |
Winogrande |
60.4 |
59.7 |
60.8 |
59.5 |
59.5 |
50.3 |
- MMLU: Evaluates general knowledge across 57 subjects including science, humanities, engineering, etc., showing the model's ability to handle diverse academic topics.
- PIQA: Evaluates physical commonsense reasoning through questions about everyday physical interactions, demonstrating the model's understanding of real - world physics concepts.
- Arc Challenge: Evaluates grade - school level multiple - choice questions requiring reasoning, indicating the model's ability to solve complex reasoning tasks.
- Winogrande: Evaluates commonsense reasoning through sentence completion tasks, showing the model's capability to understand context and resolve ambiguity.
Latency benchmarks
100 input/300 output; tok/s:
GPU/Model |
S |
M |
L |
XL |
Original |
W8A8, int8 |
H100 |
436 |
436 |
409 |
396 |
110 |
439 |
L40s |
290 |
251 |
222 |
210 |
103 |
300 |
đ License
This project is licensed under the Apache - 2.0 license.
Links