Elastic-Llama-3.2-1B-Instruct Open Source Model - Self-hosting, Highly Efficient and Flexible, with Freely Adjustable Performance Balance

Elastic Llama 3.2 1B Instruct

Developed by TheStageAI

The fastest and most flexible model for self-hosting scenarios, allowing free adjustment of model size, inference latency, and quality balance via a sliding control bar

Large Language Model Open Source License:Apache-2.0 #Multilingual instruction model #Adjustable inference speed #Self-hosting optimization

Downloads 65

Release Time : 4/14/2025

Model Overview

Optimized model series generated by TheStage AI ANNA, offering four versions with different optimization levels (XL/L/M/S) to achieve the best performance and quality balance in self-hosting scenarios

Model Features

Elastic Adjustment

Freely adjust model size, inference latency, and quality balance with a simple sliding control bar

Multi-Version Optimization

Four optimized versions (XL/L/M/S) corresponding to different levels of speed and accuracy balance

Hardware Compatibility

Supports multiple hardware platforms (H100/L40s GPU and AMD/Intel CPU), pre-compiled without JIT

Seamless Integration

Compatible with HuggingFace transformers ecosystem with just one line of code

Model Capabilities

Multilingual text generation

Instruction following

Knowledge Q&A

Content creation

Use Cases

Search Engine Enhancement

Intelligent Q&A System

Provides precise answers as a search engine backend

Achieves 45.5-46.2 points on the MMLU benchmark

Enterprise Knowledge Management

Internal Knowledge Base Q&A

Quickly responds to employee queries about company policies/processes

Achieves 73.1-74.3 points on the PIQA commonsense test

🚀 Elastic model: Llama-3.2-1B-Instruct. Fastest and most flexible models for self-serving.

Elastic models are designed to offer users the fastest and most flexible solutions for self - serving. They are produced by TheStage AI ANNA: Automated Neural Networks Accelerator. ANNA enables users to easily control model size, latency, and quality with a simple slider movement. For each base model, ANNA generates a series of optimized models:

XL: A mathematically equivalent neural network optimized by our DNN compiler.
L: A near - lossless model with less than 1% degradation on corresponding benchmarks.
M: A faster model with accuracy degradation less than 1.5%.
S: The fastest model with accuracy degradation less than 2%.

Goals of elastic models

Provide flexibility in cost - quality selection during inference.
Offer clear quality and latency benchmarks.
Provide an interface for HF libraries (transformers and diffusers) with a single line of code.
Support a wide range of pre - compiled hardware, eliminating the need for JIT.
Provide the best models and services for self - hosting.

⚠️ Important Note

Specific quality degradation can vary from model to model. For example, an S model may have a 0.5% degradation.

image/png

🚀 Quick Start

Model Information

Property	Details
Model Type	Elastic model based on Llama - 3.2 - 1B - Instruct
Base Model	meta - llama/Llama - 3.2 - 1B - Instruct
Base Model Relation	Quantized
Pipeline Tag	text2text - generation
Supported Languages	Chinese, English, French, Spanish, Portuguese, German, Italian, Russian, Japanese, Korean, Vietnamese, Thai, Arabic
License	Apache - 2.0

📦 Installation

To work with our models, follow these steps:

Install the necessary packages:

pip install thestage
pip install elastic_models[nvidia]\
 --index-url https://thestage.jfrog.io/artifactory/api/pypi/pypi-thestage-ai-production/simple\
 --extra-index-url https://pypi.nvidia.com\
 --extra-index-url https://pypi.org/simple

pip install flash_attn==2.7.3 --no-build-isolation
pip uninstall apex

Go to app.thestage.ai, log in, and generate an API token from your profile page.
Set up the API token:

thestage config set --api-token <YOUR_API_TOKEN>

💻 Usage Examples

Basic Usage

To infer our models, replace transformers import with elastic_models.transformers:

import torch
from transformers import AutoTokenizer
from elastic_models.transformers import AutoModelForCausalLM

# Currently we require to have your HF token
# as we use original weights for part of layers and
# model confugaration as well
model_name = "meta-llama/Llama-3.2-1B-Instruct"
hf_token = ''
device = torch.device("cuda")

# Create mode
tokenizer = AutoTokenizer.from_pretrained(
    model_name, token=hf_token
)
model = AutoModelForCausalLM.from_pretrained(
    model_name, 
    token=hf_token,
    torch_dtype=torch.bfloat16,
    attn_implementation="sdpa",
    mode='S'
).to(device)
model.generation_config.pad_token_id = tokenizer.eos_token_id

# Inference simple as transformers library
prompt = "Describe basics of DNNs quantization."
messages = [
  {
    "role": "system",
    "content": "You are a search bot, answer on user text queries."
  },
  {
    "role": "user",
    "content": prompt
  }
]

chat_prompt = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True, tokenize=False
)

inputs = tokenizer(chat_prompt, return_tensors="pt")
inputs.to(device)

with torch.inference_mode:
    generate_ids = model.generate(**inputs, max_length=500)

input_len = inputs['input_ids'].shape[1]
generate_ids = generate_ids[:, input_len:]
output = tokenizer.batch_decode(
    generate_ids,
    skip_special_tokens=True, 
    clean_up_tokenization_spaces=False
)[0]

# Validate answer
print(f"# Q:\n{prompt}\n")
print(f"# A:\n{output}\n")

📚 Documentation

System Requirements

GPUs: H100, L40s
CPU: AMD, Intel
Python: 3.10 - 3.12

🔧 Technical Details

Benchmarks

Benchmarking is crucial during model acceleration. We aim to provide clear performance metrics for models using our algorithms. The W8A8, int8 column indicates that W8A8 quantization with int8 data type was applied to all linear layers, using the same calibration data as for ANNA. The S model achieves similar speed but higher quality, as ANNA can improve quantization quality on sensitive layers.

Quality benchmarks

Metric/Model	S	M	L	XL	Original	W8A8, int8
MMLU	45.5	45.9	45.9	46.2	46.2	24
PIQA	73.1	73.7	74.2	74.3	74.3	55.8
Arc Challenge	34.5	35.9	36.0	35.8	35.8	20.3
Winogrande	60.4	59.7	60.8	59.5	59.5	50.3

MMLU: Evaluates general knowledge across 57 subjects including science, humanities, engineering, etc., showing the model's ability to handle diverse academic topics.
PIQA: Evaluates physical commonsense reasoning through questions about everyday physical interactions, demonstrating the model's understanding of real - world physics concepts.
Arc Challenge: Evaluates grade - school level multiple - choice questions requiring reasoning, indicating the model's ability to solve complex reasoning tasks.
Winogrande: Evaluates commonsense reasoning through sentence completion tasks, showing the model's capability to understand context and resolve ambiguity.

Latency benchmarks

100 input/300 output; tok/s:

GPU/Model	S	M	L	XL	Original	W8A8, int8
H100	436	436	409	396	110	439
L40s	290	251	222	210	103	300

📄 License

This project is licensed under the Apache - 2.0 license.